【自然语言处理】情感分析(一):基于 NLTK 的 Naive Bayes 实现

2023-11-17

情感分析(一):基于 NLTK 的 Naive Bayes 实现

朴素贝叶斯(Naive Bayes)分类器可以用来确定输入文本属于某一组类别的概率。例如,预测评论是正面的还是负面的。

它是 “朴素的”,它假设文本中的单词是独立的(但在现实的自然人类语言中,单词的顺序传达了上下文信息)。尽管有这些假设,但朴素贝叶斯在使用少量训练集预测类别时具有很高的准确性。

推荐阅读Baines, O., Naive Bayes: Machine Learning and Text Classification Application of Bayes’ Theorem.

本文代码已上传至 我的GitHub,需要可自行下载。

1.数据集

我们使用 tensorflow-datasets 提供的 imdb_reviews 数据集。这是一个大型电影评论数据集,可用于二元情感分类,包含比以前的基准数据集多得多的数据。它提供了一组 25000 25000 25000 条极性电影评论用于训练, 25000 25000 25000 条用于测试,还有其他未标记的数据可供使用。

在这里插入图片描述

2.环境准备

安装 tensorflowtensorflow-datasets,注意版本匹配问题,博主在此处踩了坑,最好不要用太新的版本,否则不兼容的问题会比较多。

首先,建一个单独的虚拟环境(选用 Python 3.6,兼容性比较好)。

安装 tensorflow

pip install tensorflow==2.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

安装 tensorflow-datasets

pip install tensorflow-datasets==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

安装 nltk

pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple/

如果导入 nltk 时报错,并提示 nltk.download(‘omw-1.4’),可以按照提示进行下载,或者直接去 NLTK Corpora 网站将文件手动下载下来放到对应的目录中。

在这里插入图片描述
在这里插入图片描述
其他包都比较好安装。

在 jupyter notebook 中编写代码之前,一定要确定好对应的虚拟环境是否选择正确,可以按照如下方法进行监测。

import sys
sys.executable

在这里插入图片描述

可以看到是我们为了本次项目所选择的虚拟环境。

3.导入包

import nltk
from nltk.metrics.scores import precision, recall, f_measure
import pandas as pd
import collections

import sys
sys.path.append("..") # Adds higher directory to python modules path.
from NLPmoviereviews.data import load_data_sent
from NLPmoviereviews.utilities import preprocessing

其中,NLPmoviereviews.data 利用 tensorflow-datasets 封装了数据下载功能。(NLPmoviereviews 是自己写的一个包。)

import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import text_to_word_sequence

def load_data(percentage_of_sentences=10):
    """
    Load the imdb_reviews dataset for given percentage of the dataset.
    Returns train-test sets
    X--> returned as list of words in lower case
    y--> returned as two classes 0 and 1 for bad and good reviews
    """
    train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], batch_size=-1, as_supervised=True)

    train_sentences, y_train = tfds.as_numpy(train_data)
    test_sentences, y_test = tfds.as_numpy(test_data)

    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)

        len_train = int(percentage_of_sentences/100*len(train_sentences))
        train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]

        len_test = int(percentage_of_sentences/100*len(test_sentences))
        test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]

    X_train = [text_to_word_sequence(_.decode("utf-8")) for _ in train_sentences]
    X_test = [text_to_word_sequence(_.decode("utf-8")) for _ in test_sentences]

    return X_train, y_train, X_test, y_test

def load_data_sent(percentage_of_sentences=10):
    """
    Load the imdb_reviews dataset for given percentage of the dataset.
    Returns train-test sets
    X--> returned as sentences in lower case
    y--> returned as two classes 0 and 1 for bad and good reviews
    """
    X_train, y_train, X_test, y_test = load_data(percentage_of_sentences)
    X_train = [' '.join(_) for _ in X_train]
    X_test = [' '.join(_) for _ in X_test]
    return X_train, y_train, X_test, y_test

NLPmoviereviews.utilities 包含了一些功能函数,比如 preprocessingembed_sentence_with_TF 等函数。

import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    """
    Use NLTK to clean text: remove numbers, stop words, and lemmatize verbs and nouns
    """
    # Basic cleaning
    sentence = sentence.strip()  # remove whitespaces
    sentence = sentence.lower()  # lowercasing
    sentence = ''.join(char for char in sentence if not char.isdigit())  # removing numbers
    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')  # removing punctuation
    tokenized_sentence = word_tokenize(sentence)  # tokenizing
    stop_words = set(stopwords.words('english'))  # defining stopwords
    tokenized_sentence_cleaned = [w for w in tokenized_sentence
                                  if not w in stop_words]  # remove stopwords
    # 1 - Lemmatizing the verbs
    verb_lemmatized = [WordNetLemmatizer().lemmatize(word, pos = "v")  # v --> verbs
              for word in tokenized_sentence_cleaned]
    # 2 - Lemmatizing the nouns
    noun_lemmatized = [WordNetLemmatizer().lemmatize(word, pos = "n")  # n --> nouns
                for word in verb_lemmatized]
    cleaned_sentence= ' '.join(w for w in noun_lemmatized)
    return cleaned_sentence

4.导入数据

# load data
X_train, y_train, X_test, y_test = load_data_sent(percentage_of_sentences=10)
X_train

X_train 是一个列表,存储了一条条文本信息,如下所示。

["this is a big step down after the surprisingly enjoyable original this sequel isn't nearly as fun as part one and it instead spends too much time on plot development tim thomerson is still the best thing about this series but his wisecracking is toned down in this entry the performances are all adequate but this time the script lets us down the action is merely routine and the plot is only mildly interesting so i need lots of silly laughs in order to stay entertained during a trancers movie unfortunately the laughs are few and far between and so this film is watchable at best",
 "perhaps because i was so young innocent and brainwashed when i saw it this movie was the cause of many sleepless nights for me i haven't seen it since i was in seventh grade at a presbyterian school so i am not sure what effect it would have on me now however i will say that it left an impression on me and most of my friends it did serve its purpose at least until we were old enough and knowledgeable enough to analyze and create our own opinions i was particularly terrified of what the newly converted post rapture christians had to endure when not receiving the mark of the beast i don't want to spoil the movie for those who haven't seen it so i will not mention details of the scenes but i can still picture them in my head and it's been 19 years",
 ...]

y_train 存储了每一条文本对应的极性: 0 0 0(负面的)或 1 1 1(正面的)。

y_train

在这里插入图片描述

5.数据预处理

rm_custom_stops 函数:移除停用词。

# remove custom stop-words
def rm_custom_stops(sentence):
    '''
    Custom stop word remover
    Parameters:
        sentence (str): a string of words
    Returns:
        list_of_words (list): cleaned sentence as a list of words
    '''
    words = sentence.split()
    stop_words = {'br', 'movie', 'film'}
    
    return [w for w in words if not w in stop_words]

process_df 函数:数据清洗、格式转换。

# perform preprocessing (cleaning) & transform to dataframe
def process_df(X, y):
    '''
    Transform texts and labels into dataframe of 
    cleaned texts (as list of words) and human readable target labels
    
    Parameters:
        X (list): list of strings (reviews)
        y (list): list of target labels (0/1)
    Returns:
        df (dataframe): dataframe of processed reviews (as list of words)
                        and corresponding sentiment label (positive/negative)
    '''
    # create dataframe from data
    d = {'text': X, 'sentiment': y}
    df = pd.DataFrame(d)
    
    # make sentiment human-readable
    df['sentiment'] = df.sentiment.map(lambda x: 'positive' if x==1 else 'negative')

    # clean and split text into list of words
    df['text'] = df.text.apply(preprocessing)
    df['text'] = df.text.apply(rm_custom_stops)

    # Generate the feature sets for the movie review documents one by one
    return df

开始处理数据。

# process data
train_df = process_df(X_train, y_train)
test_df = process_df(X_test, y_test)

查看转换格式后的训练数据 train

# inspect dataframe
train_df.head()

在这里插入图片描述

6.获取常用词

获取语​​料库中单词的频率分布,并选择 2000 2000 2000 个最常见的单词。

# get frequency distribution of words in corpus & select 2000 most common words
def most_common(df, n=2000):
    '''
    Get n most common words from data frame of text reviews
    
    Parameters:
        df (dataframe): dataframe with column of processed text reviews
        n (int): number of most common words to get
    Returns:
        most_common_words (list): list of n most common words
    '''
    # create list of all words in the train data
    complete_corpus = df.text.sum()
    
    # Construct a frequency dict of all words in the overall corpus 
    all_words = nltk.FreqDist(w.lower() for w in complete_corpus)

    # select the 2,000 most frequent words (incl. frequency)
    most_common_words = all_words.most_common(n)
    
    return [item[0] for item in most_common_words]
# get 2000 most common words
most_common_2000 = most_common(train_df)

# inspect first 10 most common words
most_common_2000[0:10]

在这里插入图片描述

7.创建 NLTK 特征集

对于 NLTK 朴素贝叶斯分类器,我们必须对句子进行分词,并找出句子与 all_words / most_common_words 共享哪些词,构成了句子的特征。(:其实就是利用 词袋模型 构建特征)

# for a given text, create a featureset (dict of features - {'word': True/False})
def review_features(review, most_common_words):
    '''
    Feature extractor that checks whether each of the most
    common words is present in a given review
    
    Parameters:
        review (list): text reviews as list of words
        most_common_words (list): list of n most common words
    Returns:
        features (dict): dict of most common words & corresponding True/False
    '''
    review_words = set(review)
    features = {}
    for word in most_common_words:
        features['contains(%s)' % word] = (word in review_words)
    return features
# create featureset for each text in a given dataframe
def make_set(df, most_common_words):
    '''
    Generates nltk featuresets for each movie review in dataframe.
    Feature sets are composed of a dict describing whether each of the most 
    common words is present in the text review or not

    Parameters:
        df (dataframe): processed dataframe of text reviews
        most_common_words (list): list of most common words
    Returns:
        feature_set (list): list of dicts of most common words & corresponding True/False
    '''
    return [(review_features(df.text[i], most_common_words), df.sentiment[i]) for i in range(len(df.sentiment))]
# make data into featuresets (for nltk naive bayes classifier)
train_set = make_set(train_df, most_common_2000)
test_set = make_set(test_df, most_common_2000)
# inspect first train featureset
train_set[0]
({'contains(one)': True,
  'contains(make)': False,
  'contains(like)': False,
  'contains(see)': False,
  'contains(get)': False,
  'contains(time)': True,
  'contains(good)': False,
  'contains(watch)': False,
  'contains(character)': False,
  'contains(story)': False,
  'contains(go)': False,
  'contains(even)': False,
  'contains(think)': False,
  'contains(really)': False,
  'contains(well)': False,
  'contains(show)': False,
  'contains(would)': False,
  'contains(scene)': False,
  'contains(end)': False,
  'contains(look)': False,
  'contains(much)': True,
  'contains(say)': False,
  'contains(know)': False,
  ...},
 'negative')

8.训练并评估模型

选用 nltk 提供的朴素贝叶斯分类器(NaiveBayesClassifier)。

# Train a naive bayes classifier with train set by nltk
classifier = nltk.NaiveBayesClassifier.train(train_set)
# Get the accuracy of the naive bayes classifier with test set
accuracy = nltk.classify.accuracy(classifier, test_set)
accuracy

在这里插入图片描述

# build reference and test set of observed values (for each label)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
 
for i, (feats, label) in enumerate(train_set):
    refsets[label].add(i) # 存储不同标签对应的训练数据(分类前结果)
    observed = classifier.classify(feats) # 根据训练数据的特征进行分类
    testsets[observed].add(i) # 存储不同标签对应的训练数据(分类后结果)
# print precision, recall, and f-measure
print('pos precision:', precision(refsets['positive'], testsets['positive']))
print('pos recall:', recall(refsets['positive'], testsets['positive']))
print('pos F-measure:', f_measure(refsets['positive'], testsets['positive']))
print('neg precision:', precision(refsets['negative'], testsets['negative']))
print('neg recall:', recall(refsets['negative'], testsets['negative']))
print('neg F-measure:', f_measure(refsets['negative'], testsets['negative']))

在这里插入图片描述
显示前 n n n 个最有用的特征:

# show top n most informative features
classifier.show_most_informative_features(10)

在这里插入图片描述

9.预测

# predict on new review (from mubi.com)
new_review = "Surprisingly effective and moving, The Balcony Movie takes the Front Up \
            concept of talking to strangers, but here attaches it to a fixed perspective \
            in order to create a strong sense of the stream of life passing us by. \
            It's possible to not only witness the subtle changing of seasons\
            but also the gradual opening of trust and confidence in Lozinski's \
            repeating characters. A Pandemic movie, pre-pandemic. 3.5 stars"
# perform preprocessing (cleaning & featureset transformation)
processed_review = rm_custom_stops(preprocessing(new_review))
processed_review = review_features(processed_review, most_common_2000)
# predict label
classifier.classify(processed_review)

在这里插入图片描述
获取每个标签及对应单词的概率:

# to get individual probability for each label and word, taken from:
# https://stackoverflow.com/questions/20773200/python-nltk-naive-bayes-probabilities
for label in classifier.labels():
    print(f'\n\n{label}:')
    for (fname, fval) in classifier.most_informative_features(50):
        print(f"   {fname}({fval}): ", end="")
        print("{0:.2f}%".format(100*classifier._feature_probdist[label, fname].prob(fval)))
negative:
   contains(delightful)(True): 0.12%
   contains(absurd)(True): 2.51%
   contains(beautifully)(True): 0.28%
   contains(noir)(True): 0.20%
   contains(unfunny)(True): 2.03%
   contains(magnificent)(True): 0.20%
   contains(poorly)(True): 4.49%
   contains(dreadful)(True): 1.71%
   contains(worst)(True): 15.63%
   contains(waste)(True): 12.29%
   contains(turkey)(True): 1.47%
   contains(vietnam)(True): 1.47%
   contains(restore)(True): 0.20%
   contains(lame)(True): 4.73%
   contains(brilliantly)(True): 0.28%
   contains(awful)(True): 8.15%
   contains(garbage)(True): 3.14%
   contains(worse)(True): 8.39%
   contains(intense)(True): 0.44%
   contains(wonderfully)(True): 0.36%
   contains(laughable)(True): 2.59%
   contains(unbelievable)(True): 2.90%
   contains(finest)(True): 0.36%
   contains(pointless)(True): 3.30%
   contains(crap)(True): 5.85%
   contains(trial)(True): 0.28%
   contains(disappointment)(True): 3.62%
   contains(warm)(True): 0.36%
   contains(unconvincing)(True): 1.47%
   contains(lincoln)(True): 0.12%
   contains(underrate)(True): 0.36%
   contains(pathetic)(True): 2.98%
   contains(unfold)(True): 0.36%
   contains(zero)(True): 2.11%
   contains(existent)(True): 1.71%
   contains(shallow)(True): 1.71%
   contains(dull)(True): 5.37%
   contains(cheap)(True): 4.18%
   contains(mess)(True): 4.89%
   contains(perfectly)(True): 0.91%
   contains(ridiculous)(True): 5.85%
   contains(excuse)(True): 3.70%
   contains(che)(True): 0.12%
   contains(gritty)(True): 0.36%
   contains(pleasant)(True): 0.36%
   contains(mediocre)(True): 2.59%
   contains(rubbish)(True): 1.55%
   contains(insult)(True): 2.90%
   contains(porn)(True): 1.87%
   contains(douglas)(True): 0.36%


positive:
   contains(delightful)(True): 1.97%
   contains(absurd)(True): 0.20%
   contains(beautifully)(True): 3.33%
   contains(noir)(True): 2.37%
   contains(unfunny)(True): 0.20%
   contains(magnificent)(True): 1.73%
   contains(poorly)(True): 0.52%
   contains(dreadful)(True): 0.20%
   contains(worst)(True): 1.89%
   contains(waste)(True): 1.65%
   contains(turkey)(True): 0.20%
   contains(vietnam)(True): 0.20%
   contains(restore)(True): 1.33%
   contains(lame)(True): 0.76%
   contains(brilliantly)(True): 1.73%
   contains(awful)(True): 1.33%
   contains(garbage)(True): 0.52%
   contains(worse)(True): 1.41%
   contains(intense)(True): 2.61%
   contains(wonderfully)(True): 2.13%
   contains(laughable)(True): 0.44%
   contains(unbelievable)(True): 0.52%
   contains(finest)(True): 1.97%
   contains(pointless)(True): 0.60%
   contains(crap)(True): 1.08%
   contains(trial)(True): 1.49%
   contains(disappointment)(True): 0.68%
   contains(warm)(True): 1.89%
   contains(unconvincing)(True): 0.28%
   contains(lincoln)(True): 0.60%
   contains(underrate)(True): 1.81%
   contains(pathetic)(True): 0.60%
   contains(unfold)(True): 1.73%
   contains(zero)(True): 0.44%
   contains(existent)(True): 0.36%
   contains(shallow)(True): 0.36%
   contains(dull)(True): 1.16%
   contains(cheap)(True): 0.92%
   contains(mess)(True): 1.08%
   contains(perfectly)(True): 4.06%
   contains(ridiculous)(True): 1.33%
   contains(excuse)(True): 0.84%
   contains(che)(True): 0.52%
   contains(gritty)(True): 1.57%
   contains(pleasant)(True): 1.57%
   contains(mediocre)(True): 0.60%
   contains(rubbish)(True): 0.36%
   contains(insult)(True): 0.68%
   contains(porn)(True): 0.44%
   contains(douglas)(True): 1.49%

比如 d e l i g h t f u l delightful delightful,在 negative 下是 0.12 % 0.12\% 0.12%,在 positive 下是 1.97 % 1.97\% 1.97%,而 1.97 % ∶ 0.12 % = 16.5 ∶ 1.0 1.97\% ∶ 0.12\% = 16.5 ∶ 1.0 1.97%∶0.12%=16.5∶1.0

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

【自然语言处理】情感分析(一):基于 NLTK 的 Naive Bayes 实现 的相关文章

随机推荐

  • 回到十七世纪,让我来编算一本常用对数表(对数结果是小数,要有分数的思想,指数的幂是分数,这样可以实现小数分数转化位整数的幂的求法)

    原文地址 回到十七世纪 让我来编算一本常用对数表作者 小牛 自十八 九岁学习了对数后 就觉得造对数表真不简单 据说十七世纪那时 说如果谁发现了对数表上有一个数字错 就奖一两黄金 据百科百度 纳皮尔 1550 1617年 苏格兰数学家 对数的
  • GBK/GBK2312字库寻址及使用原理

    一 字符编码 1 1 ASCII编码 我们知道 我们所见到的所有字符编码 对于计算机来说都是0 1 更具不同位上的0 1 一个字节 8位 共有256中排列方式 因此一个字节就可以表示256个不同的字符 在这个前提下 ASCII编码就由美国制
  • 因材施教,有道发布“子曰”教育大模型,落地虚拟人口语教练等六大应用

    因材施教的教育宗旨下 大模型浪潮中 网易有道凭借其对教育场景的深入理解和对商业化的理性思考 为行业树立了垂直大模型的典范 7月26日 教育科技公司网易有道举办了 powered by 子曰 教育大模型应用成果发布会 会上重磅推出了国内首个教
  • 游戏与创新

    一 严肃游戏的概念 严肃游戏的出现 最早可以追溯到公元前1400年 当时在古埃及 有一款名为播棋的棋类游戏 这种游戏以植物种子或石子作为道具 在地面或棋盘上 目的是把对方的棋子吃掉 同时 播棋也在日常交易中 被用作计算数量的商业辅助工具 严
  • 5G 速率介绍

    本文就对如何计算最大速率不做介绍 主要介绍的是NSA和SA速率的差别 以及上下行速率的差异 由于5G单用户下行可以到4流 上行可以到2流 而且上下行都支持256QAM 2 6G频率 5ms单周期 配置下 NSA和SA终端下行峰值速率约1 6
  • 用FFmpeg制作WebP动图

    去年写过一篇文章 是教大家用FFmpeg制作GIF动画的 今天在讨论到项目中碰到的一个 apng动画素材引起的程序崩溃时 有位同学建议 我们为啥不用WebP来代替 apng 是啊 why not 网上搜到一篇文章 自己试了一下 发现强大的F
  • SQL Server Management Studio (SSMS) 指定端口Port连接, 用逗号, 例如: localhost,1433

    Microsoft SQL Server Management Studio SSMS 指定端口连接 用 逗号 例如 localhost 1433 localhost 等效 localhost 逗号 1433
  • QT 解决“ qt creator 修改UI后,运行无改变”问题

    只需要将 项目 中的 Shadow build 勾选去掉 重新构建项目 运行即可看到修改后的效果
  • 三、-切换frame&css选择器

    等待元素出现 隐式等待 用WebDriver 对象的implicitly wait方法 这个方法有一个参数 指明等待最长时间 driver implicitly wait 10 显式等待 Selenium里面还有一种称之为显式等待的 可以为
  • Spring的自动装配 byName和byType的区别

    Spring的自动装配 好处 大幅度减少Spring配置 坏处 依赖不能明确管理 可能会有多个bean同时符合注入规则 没有清晰的依赖关系 在装配的时候会有两种方式 byName和byType两种 byName 根据属性名自动装配 此选项将
  • el-checkbox如何同时获得value值和label的值

    在使用el checkbox时有时候需要往后台传送value值 定义的code 和label值 el checkbox组件默认获取的都是label属性中的值 如果label中设置的是code 那如何获取el checkbox的显示文字 la
  • 第一节 初识C语言

    第一节 初识C语言 目录 一 什么是C语言 二 第一个C语言程序 三 数据类型 四 变量与常量 五 字符串 转义字符 注释 六 结构 七 函数 八 数组 九 操作符 十 常见关键字 十一 define定义常量和宏 十二 指针 十三 结构体
  • 通过filebeat、logstash、rsyslog采集nginx日志的几种方式

    由于nginx功能强大 性能突出 越来越多的web应用采用nginx作为http和反向代理的web服务器 而nginx的访问日志不管是做用户行为分析还是安全分析都是非常重要的数据源之一 如何有效便捷的采集nginx的日志进行有效的分析成为大
  • postgresql教程

    postgreSQL教程目录 postgreSQL 创建数据库的方式 postgreSQL删除数据库的方式 PostgreSQL 创建表格 postgre删除表格 postgreSQL INSERT INTO 语句 postgreSQL S
  • Android中使用AES加密和解密

    Android中如何使用AES进行加密和解密 作为初入社会的IT行业的大专实习生 只能在公司里被当牛一样每天工作到半夜十二点以后 在这里也是第一次写博客 目的只有两个 1 记录技术实现 尽可能的帮助网友解决困难 2 吐槽公司 不多说了 直接
  • [每日设计] 01、小贱钟,基于STM32的手写机器人

    每日设计 01 小贱钟 基于STM32的手写机器人 原创 starger District 46实验室 3月25日 看厌了圆形的钟表 数字显示又太生硬 相信很多人和我一样 想换一个与众不同的方式看时间 小贱钟就是这样一个有趣的东东 它可以用
  • flask SQLAlchemy-ORM增删改查

    ORM介绍 随着项目越来越大 采用原生SQL的方式在代码中会出现大量的SQL语句 对项目的进展非常不利 SQL语句重复利用率不高 越复杂的SQL语句条件越多 代码越长 会出现很多相近似的SQL语句 很多SQL语句是在业务逻辑中拼出来的 如果
  • 文件ajax上传方式

    直接看代码 1 html table class layui table thead tr td width 20 产品信息 td td 产品信息 td tr thead tbody tr td 产品名称 td td td tr tbody
  • DOM之获取标签元素、属性和属性值

    1 获取标签元素 docunment getElementById id 只能获取一个id标签 docunment getElementByClassName class 获取class标签 结果是一个为数组 不能用forEach docu
  • 【自然语言处理】情感分析(一):基于 NLTK 的 Naive Bayes 实现

    情感分析 一 基于 NLTK 的 Naive Bayes 实现 朴素贝叶斯 Naive Bayes 分类器可以用来确定输入文本属于某一组类别的概率 例如 预测评论是正面的还是负面的 它是 朴素的 它假设文本中的单词是独立的 但在现实的自然人