SGDClassifier 每次为文本分类提供不同的准确度

2024-04-21

我使用 SVM 分类器将文本分类为好文本和乱码。我正在使用 python 的 scikit-learn 并按如下方式执行:

'''
Created on May 5, 2017
'''

import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

# Prepare data

def prepare_data(data):
    """
    data is expected to be a list of tuples of category and texts.
    Returns a tuple of a list of lables and a list of texts
    """
    random.shuffle(data)
    return zip(*data)

# Format training data

training_data = [
    ("good", "rain a lot the packs maybe damage."),
    ("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner.  he needs a call back from customer support "),
    ("gibber", "wh. screen"),
    ("gibber", "How will I know if I"),
    ("good", "I have problems scheduling blocks they are never any available.  Can I do full time?  Can I get scheduled more than one day a month?"),
    ("good", "Suggestion: easier way to sign in due alleviate the tediousness of periodically having to sign back in to the app to check for blocks."),
    ("good", "I am so glad to hear from you. "),
    ("good", "loading on today's itinerary takes ages!!!!!! time consuming when you have 150+ packages to deliver!!!!!"),
    ("good", "due to the new update that makes hours available at 10 pm. if you worked 8 hours that day you can't see next day hours due to 8 hour limit. please fix this"),
    ("good", "omg, PLEASE make it so we don't have to sign in every time we need to go into the app. At least make it good for a week. Thanks."),
    ("good", "Constantly being logged out of app, if we could have a continuous login so we could receive notifications if blocks are available that would be ideal."),
    ("good", "I am having problems  with the App. Every time I exit the App and reopen it asks for my login info."),
    ("good", "15 minute service time due to 33rd floor and  20l lbs of cargo"),
    ("good", "I have been sceduled 1 block in 3 weeks. I check for new block availability multiple times a day and have not seen 1 available in three weeks. is there any way to get more blocks."),
    ("good", "When will delivery jobs be available? Everytime I open this app, it says nothing is available. Have deliveries in Cincinnati started yet?"),
    ("good", "During delivery had to call customer support and after 10 minutes support person couldn't find my pick up location Kirkland /Bellevue and told me to hang up and call different support team.  Support person were unprofessional and rude, which is not acceptable."),
    ("good", "can you please remove the pick up from my phone"),
    ("good", "Dear friends: I'm very very happy it's a big oportunitt"),
    ("good", "THANK YOU so much for the block you assigned me for next week.   If you have an additional 5 blocks please go ahead and assign them to me for next week.  My availability is updated and current.  You guys are awesome!!!"),
    ("good", "after update every time I open app I have too log in! I used to be able to stay logged in unless I logged out, can you return stay logged in option."),
    ("good", "It looks like my app is not installed properly on my android phone, Note 5. I cannot access or do not see the tab to swipe to start delivering and the map or help button that should be visible for me to work today 5/6 at rpm"),
    ("gibber", "AF0000"),
    ("good", "awesome app, awesome hiring process, awesome delivery warehouse , awesome team and help in the field! lets deliver I would like more more more delivers , looking forward to the future ! I just bought a new delivery vehical !"),
    ("good", "I will like to ask why I can't get more delivery's only one in two weeks"),
    ("good", "device too slow software crashing all day"),
    ("good", "it doesn't work sometimes."),
    ("good", "can you please remove the old sprouts pick up from my phone"),
    ("good", "They ability to zoom in on text screens would be very helpful. Am example would be customer notes when viewing in certain lighting conditions can be difficult."),
    ("good", "I missed out on a delivery day when I clicked check in and waited for my turn to get an order only to find out that not only did my check in not register but the gps showed me down the street. I encountered this issue again when one of the warehouse employees placed an order for that location and the app wanted me to drive in a big circle to get back to where I was standing."),
    ("good", "i am a little concerned that i didn't receive any blocks of time for this coming week, even though i had a perfect delivery score from this past pay period. Did the Cincinnati market over hire drivers where there are many people being shut completely out of any delivery blocks for an entire week? i really enjoy this type of work and the app makes it quite convenient."),
    ("good", "I've arrived at the pick up restaurant but the staff did not have the barecode for me to scan, however I pick up the package and deliver but my is still not let me move on"),
    ("good", "might want to check my assigned hours for next week.  5am to 1pm??"),
    ("good", "hi team--just want to give some positive feedback.  I have had nothing but positive feedback from customers. Great support when calling help line. Thank you for this opportunity and if there is ever a situation where you need drivers immediately I will drop what I'm doing and help. You guys are the best."),
    ("good", "Allow days or blocks throughout the day to be modified after General availability is set up for time off like doctors appointments."),
    ("gibber", "AL001234"),
    ("good", "Please, enlight me."),
    ("good", "it only shows my schedule starting in two weeks. when will we be able to start work"),
    ("good", "include more packages for one block, if the packages can be fitted into the car, so driver don't have to come back and pickup every two hours. 25% of the time is wasted coming back for pick up."),
    ("gibber", "BBB h"),
    ("gibber", "AG0003006033SDgCJ12344"),
    ("gibber", "How will I know if I"),
    ("good", "please bring back some sort of hours cap! or possibly stagger the hour drops from 1200 to 1203 so that people with slower internet/slower phone arent at a disadvantage!"),
    ("good", "when the hours released tonight all of the people who didn't have 40 hours could see them.    however the drivers that are capped at 40 were unable to see them due to a flawed system.  please fix the system so that we are not continually treated unfairly like all of the drivers that whined so much and got us in to this mess.  the cap system is unfair to people that want to work and it caused problems with a lack of drivers  to deliver today at the hub.  obviously this is not a good system and benefits no one."),
    ("good", "You have seriously messed up the whole scheduling process. Why can't I get any blocks at 10 even if I wait exactly until 10? Midnight was much better. So now that scheduling is a huge random pain in the ass, why would people want to keep doing this? I haven't been able to schedule work for three days now, it's quite frustrating when I don't get a chance to sign up, even when I'm diligent with timing."),
    ("good", "Seriously, that's all I'm going to get is one lousy day? Tell me again why you need drivers if all we get is one day. I'm not sure this is gonna work out for me. I waited forever to get my background check back and this is what I get? smh"),
    ("good", "doesn't save updated access codes"),
    ("good", "the scheduling of my route is nor done very accurately. it keeps me driving back and forth"),
    ("good", "can't understand how to pick up a block. my availability is wide open. when you guys send the alerts about blocks available I open it real quick and there is nothing there. I do it in a matter of seconds"),
    ("good", "My availability keeps disappearing from my calender.   I set my availability for three weeks in advance. The gray dots are visible  but disappear on Wednesday or Thursday.  This makes it impossible for me to see and choose available blocks for the upcoming week. How can I get it fix.   Mike"),
    ("good", "GPS blank screen"),
    ("gibber", "sea swq"),
    ("gibber", "hiw o"),
    ("gibber", "Dr a"),
    ("gibber", "quick to quick to u uhu wu just us"),
    ("gibber", "Awa what's"),
    ("gibber", "wxdfcs"),
    ("gibber", "7k9opu"),
    ("gibber", "o.m.day day"),
    ("gibber", "GGT part his h"),
    ("gibber", "aawfhg"),
    ("gibber", "seesaw 2s"),
    ("gibber", "wawaa"),
    ("gibber", "of ll"),
    ("gibber", "rewards"),
    ("gibber", "mmqqm5my"),
    ("gibber", ".in w"),
    ("gibber", "play r"),
    ("gibber", "was wwnw www www n"),
    ("gibber", "wqq2fwqq2fz22"),
    ("gibber", "not"),
    ("gibber", "I by yu I"),
    ("gibber", "Hi just wanted to let you know that it's bee"),
    ("gibber", "I erroneously v"),
    ("gibber", "I find it"),
    ("gibber", "bqyyx I a"),
    ("gibber", "are are"),
    ("gibber", "wawi waarnnnkwn"),
    ("gibber", "t Petey ueteu he"),
    ("gibber", "ews ri"),
    ("gibber", "bd xd"),
    ("gibber", "hatpa"),
    ("gibber", "se wests tasgt"),
    ("gibber", "wa vgcx azc Jo of"),
    ("gibber", "2w222"),
    ("gibber", "her u t b"),
    ("gibber", "ddddedc"),
    ("gibber", "just juju in hiking"),
    ("gibber", "wew2ww2wwwew2i2wkkk"),
    ("gibber", "meleeee"),
    ("gibber", "Aaq wqXD"),


]
training_labels, training_texts = prepare_data(training_data)


# Format test data

test_data = [

("gibber", "an quality"),
    ("good", "Can't check in.   Time was 4:06.  I didn't drive out here for no reason."),
    ("good", "can you do view all full address including postal code how it's in old app that helps do correctly delivery and not waist customer time"),
    ("good", "i am available again starting at 10am to 10pm. thanks"),
    ("gibber", "Hello, I encountered"),
    ("good", "I want to know how we are notified if there is a block I have been signed in and haven't been given a block yet"),
    ("gibber", "aawaaw"),
    ("gibber", "eeeeeeeeene"),
    ("good", "I am not getting enough shifts"),
    ("gibber", "hey e75k"),
    ("good", "my screen had went black or inverted"),
    ("good", "maps packed up again in sr20ls"),
    ("good", "how to clear my itinerary from old pickup address ?"),
    ("good", "keep signing me out."),
    ("good", "For alcohol delivery,  where does customer sign?"),
    ("gibber", "t Petey ueteu he"),
    ("good", "can't get blocks.  too many drivers ??"),
    ("good", "got a new phone how do i download to new phone")



]
test_labels, test_texts = prepare_data(test_data)


# Create feature vectors

"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels


# Train the classifier


clf = SGDClassifier()
clf.fit(X, y)


# Test performance

X_test = vectorizer.transform(test_texts)
y_test = test_labels

# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)

# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
print(annotated_test_data)

# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))

但是,每次运行时我都会得到不同的准确度。为什么会发生这种情况?

更新: 所以我将 Training_data 移动到一个文本文件中,并在上面的代码中读取它,如下所示:

lines = [line.rstrip('\n') for line in open("file.txt")]
training_data=[]
for i in lines:
    result = i.rstrip(',')
    l = literal_eval(result)
    training_data.append(l)

training_labels, training_texts = prepare_data(training_data)

我还在上面的代码中更改了这一点:

clf = SGDClassifier(random_state=5000)

所以,现在 random_state 不是 None。但是,我每次仍然得到不同的准确度!


这是因为在你的prepare_data()方法,您随机地洗牌数据。这就是你正在做的事情:

random.shuffle(data)

因此它会影响估计器的训练,从而影响结果。

尝试注释或删除该行以及random_state设置在SGDClassifier。您每次都会得到完全相同的结果。

建议:尝试使用不同的估算器,看看哪一个表现最好。如果您热衷于使用SGDClassifier,那么我建议您查看并了解n_iter范围。尝试将其更改为更大的值,您会发现准确性的差异会变得越来越小(即使您对数据进行了改组)。

您可以查看此答案以了解更多详细信息:

  • https://datascience.stackexchange.com/a/9794 https://datascience.stackexchange.com/a/9794
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

SGDClassifier 每次为文本分类提供不同的准确度 的相关文章

随机推荐

  • 我应该使用哪些 gdb 命令来缩小标签“main”中出现分段错误的位置?

    这是我的汇编代码和我的主要子例程 这是我的宏和常量 text fmt string x t t ln x n sfmt string 10lf t 10lf n error string Error filename string inpu
  • 同一 IP 443 端口中的多个域

    我在 IIS 7 的端口 443 https 上托管了一个网站 www example1 com 现在我为同一 IP 的 www example2 com 购买了一个新域 我想在此域中托管另一个网站 www example2 com htt
  • Jquery 获取具有特定类的第 n 个子级

    我有一个 html 表如下 table tr td class take 1 td td 2 td td 3 td td class take 4 td td 5 td td class take 6 td tr tr td class t
  • 如何在 Java 8 中组合不同的流

    我有一个Set
  • 在代码中添加一个定时器,然后循环它

    尝试找到一种方法将计时器添加到我的代码中 然后用计时器不断循环它 例如 尝试通过单击按钮来制作物品 然后等待 5 秒以使其制作 然后只要我有材料 它就会自动开始再次制作 依此类推 我环顾四周的教程 但未能找到我一直在寻找的东西 这是我想要循
  • 专门针对右值的 std::swap

    在标准 20 2 2 utility swap 中 std swap 是为左值引用定义的 我知道这是当你想交换两件东西时的常见情况 但是 有时交换右值是正确且可取的 当临时对象包含引用时 如下所示 交换临时引用元组 https stacko
  • 如何仅定义自定义产品类型的字段 - Woo Commerce Hook

    我的代码显示在所有产品类型中 例如简单产品 可变产品 自定义类型 手段适用于所有人 但我想将其限制为仅适用于我的自定义类型 如何将自定义字段类型限制为英语课程产品类型 add filter product type selector eng
  • Tensorflow 中多维时间序列预测中的向量表示

    我有一个大型数据集 约 3000 万个数据点 具有 5 个特征 我已使用 K 均值将其减少到 200 000 个集群 数据是大约 150 000 个时间步长的时间序列 我想要训练模型的数据是每个时间步上特定簇的存在 预测模型的目的是生成一个
  • 将 Ajax JQuery 选择器保存在数组中

    我对 Ajax 非常陌生 需要帮助将 Ajax 请求中的数据存储到数组中 我在论坛上查看了答案 但无法解决我的问题 Ajax 响应正在进入 responseField val format output response 我想将 outpu
  • 等待多个 future 的回调

    最近我深入研究了一些使用 API 的工作 该API使用Unirest http库来简化从网络接收的工作 当然 由于数据是从 API 服务器调用的 因此我尝试通过使用对 API 的异步调用来提高效率 我的想法结构如下 通过返回 future
  • JDK 17:Switch 语句导致 java.lang.VerifyError:操作数堆栈上的类型错误

    刚刚在 Eclipse 2021 09 上尝试了 JDK17 结果失败并显示java lang VerifyError 这本身并没有多大帮助 我追踪到了一个 switch 语句 它被提供了一个从 a 中取出的值Map或其他泛型类型 如果我在
  • React-native cli 和带有 Bare 工作流程的 Expo 有什么区别? [关闭]

    Closed 这个问题是基于意见的 help closed questions 目前不接受答案 我将构建一个具有多种复杂功能的非常大的应用程序 但我坚持以下几点 React native cli 和带有 Bare 工作流程的 Expo 有什
  • 在非常大的数组中查找重复项的算法

    在一次技术面试中得到了这个问题 我知道使用 在java中 HashSet解决这个问题的方法 但当面试官强行说出 这个词时 我无法理解一个非常大的数组 假设给定数组中有 1000 万个元素 我需要改变方法吗 如果不是 实现这一目标的效率应该是
  • Scrapy蜘蛛抓取页面和抓取项目之间的区别

    我正在编写一个 Scrapy CrawlSpider 它读取第一页上的 AD 列表 获取一些信息 例如列表和 AD url 的缩略图 然后向每个 AD url 发出请求以获取其详细信息 它在测试环境中工作和分页显然很好 但今天试图进行完整的
  • Java 中是否有与 Python 的 defaultdict 等效的工具?

    在 Python 中 defaultdict类提供了一种方便的方法来创建映射key gt list of values 在下面的示例中 from collections import defaultdict d defaultdict li
  • Bootstrap 模式确认表行删除

    我对网络工作非常陌生 我希望我能在这里得到一些有用的答案 我正在使用引导框架来设计一个网站 但遇到了一个小问题 我有一个表格 最后一个单元格中有一个删除按钮 我希望该按钮可以删除整行 我希望删除按钮激活引导模式以在删除之前确认表行删除 基本
  • Jenkins 未找到 SureFire 报告

    我已经在本地 jenkins 服务器中创建了一个 Maven 项目作业 项目 并添加了jenkin的TestNG插件来查看测试报告 但该作业没有显示我的 TestNg 结果 我看到以下错误 TestNG 报告处理 开始使用模式在工作区中查找
  • 使用 Backbone-Relational 实现多对多关系

    我有一个简单的应用程序 它定义了两个类 一个Person and a PersonGroup 其中存在多对多关系 一个人可以没有组 或者被分配到所有组 以及介于两者之间的任何组 backbonerelational org 上的示例建议对多
  • java中的@Documented注解

    目的是什么 Documentedjava中的注释 我看到了文档 但无法从中获得太多信息 有人可以通过一个清晰的例子指出 Documented是一个元注释 你申请 Documented定义注释时 确保使用您的注释的类在其生成的 JavaDoc
  • SGDClassifier 每次为文本分类提供不同的准确度

    我使用 SVM 分类器将文本分类为好文本和乱码 我正在使用 python 的 scikit learn 并按如下方式执行 Created on May 5 2017 import re import random import numpy