1. 安装必要的工具与库
在Ubuntu系统中,首先需要安装Python环境(建议使用Python 3.6+)及NLP所需的库。打开终端,执行以下命令:
sudo apt update && sudo apt upgrade -ysudo apt install python3 python3-pip -ypip3 install nltk spacy textblob gensim transformerspython3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"python3 -m spacy download en_core_web_sm2. 文本预处理:NLP的基础步骤
文本预处理是将原始文本转换为适合分析的格式,主要包括分词、去停用词、词干提取/词形还原:
word_tokenize或spaCy的tokenizer:import nltk
from nltk.tokenize import word_tokenize
text = "Natural Language Processing is fascinating."
tokens = word_tokenize(text) # 输出:['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']
或使用spaCy:import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc] # 输出同上
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words] # 输出:['Natural', 'Language', 'Processing', 'fascinating', '.']
PorterStemmer或spaCy的词形还原:from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens] # 输出:['natur', 'languag', 'process', 'fascin', '.']
# spaCy词形还原
lemmatized_tokens = [token.lemma_ for token in doc] # 输出:['natural', 'language', 'processing', 'be', 'fascinate', '.']
3. 关键NLP任务实现
识别文本中每个单词的词性(如名词、动词、形容词)。使用NLTK或spaCy:
# NLTK
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens) # 输出:[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fascinating', 'JJ'), ('.', '.')]
# spaCy
for token in doc:
print(token.text, token.pos_) # 输出:Natural NOUN, Language PROPN, Processing PROPN, is AUX, fascinating ADJ, . PUNCT
识别文本中的实体(如人名、地名、组织名)。使用NLTK或spaCy:
# NLTK
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
ner_tree = ne_chunk(pos_tags) # 输出:(S (ORGANIZATION Natural) (ORGANIZATION Language) (ORGANIZATION Processing) is fascinating .)
# spaCy
for ent in doc.ents:
print(ent.text, ent.label_) # 若文本中有人名/地名,会输出对应实体及类型(如"Apple" → ORG)
判断文本的情感倾向(积极、消极、中性)。使用TextBlob:
from textblob import TextBlob
blob = TextBlob("I love Python. It's amazing!")
sentiment = blob.sentiment # 输出:Sentiment(polarity=0.8, subjectivity=0.75)
# polarity范围[-1,1],>0为积极,<0为消极;subjectivity范围[0,1],>0.5为主观
发现文本中的隐藏主题。使用Gensim:
from gensim import corpora, models
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# 准备语料库
texts = ["Python is great for data analysis.", "Data science requires Python skills.", "Machine learning uses Python libraries."]
tokens = [word_tokenize(text.lower()) for text in texts]
stop_words = set(stopwords.words('english'))
filtered_tokens = [[word for word in token if word.isalpha() and word not in stop_words] for token in tokens]
# 创建词典和语料库
dictionary = corpora.Dictionary(filtered_tokens)
corpus = [dictionary.doc2bow(text) for text in filtered_tokens]
# 训练LDA模型(2个主题)
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=10)
topics = lda_model.print_topics() # 输出每个主题的关键词及权重
for topic in topics:
print(topic)
4. 进阶:使用预训练模型(如BERT)
Hugging Face的transformers库提供了预训练的BERT模型,可用于文本分类、问答等任务:
from transformers import pipeline
# 加载预训练的情感分析模型
classifier = pipeline("sentiment-analysis")
result = classifier("I love Ubuntu and Python!") # 输出:[{'label': 'POSITIVE', 'score': 0.9998}]
注意事项
venv创建虚拟环境);