mllib NaiveBayes 中的类数量有限制吗？调用 model.save() 时出错

2023-12-03

我正在尝试训练一个模型来预测文本输入数据的类别。我使用以下方法遇到了似乎数值不稳定的问题pyspark.ml.classification.NaiveBayes当类别数量超过一定数量时，对词袋进行分类。

在我的现实世界项目中，我有大约 10 亿条记录和大约 50 个类。我能够训练我的模型并做出预测，但是当我尝试使用保存它时出现错误model.save()。从操作上来说，这很烦人，因为我每次都必须从头开始重新训练我的模型。

在尝试调试时，我将数据缩小到大约 10k 行，并在尝试保存时遇到了相同的问题。但是，如果我减少类标签的数量，保存效果很好。

这让我相信标签的数量是有限的。我无法重现我的确切问题，但下面的代码是相关的。如果我设置num_labels任何大于 31 的值，model.fit()抛出错误。

我的问题：

班级人数有限制吗mllib实施NaiveBayes?
如果我可以成功地使用模型进行预测，那么我无法保存模型的原因可能是什么？
如果确实存在限制，是否可以将我的数据分成更小的类别组，训练单独的模型，然后组合？

完整的工作示例

创建一些虚拟数据。

我要使用nltk.corpus.comparitive_sentences and nltk.corpus.sentence_polarity。请记住，这只是一个带有无意义数据的说明性示例 - 我不关心拟合模型的性能。

import pandas as pd
from pyspark.sql.types import StringType

# create some dummy data
from nltk.corpus import comparative_sentences, sentence_polarity
df = pd.DataFrame(
    {
        'sentence': [" ".join(s) for s in cs.sents() + sp.sents()]
    }
)

# assign a 'category' to each row
num_labels = 31  # seems to be the upper limit
df['category'] = (df.index%num_labels).astype(str)

# make it into a spark dataframe
spark_df = sqlCtx.createDataFrame(df)

数据准备管道

from pyspark.ml.feature import NGram, Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vector

indexer = StringIndexer(inputCol='category', outputCol='label')
tokenizer = Tokenizer(inputCol="sentence", outputCol="sentence_tokens")
remove_stop_words = StopWordsRemover(inputCol="sentence_tokens", outputCol="filtered")
unigrammer = NGram(n=1, inputCol="filtered", outputCol="tokens") 
hashingTF = HashingTF(inputCol="tokens", outputCol="hashed_tokens")
idf = IDF(inputCol="hashed_tokens", outputCol="tf_idf_tokens")

clean_up = VectorAssembler(inputCols=['tf_idf_tokens'], outputCol='features')

data_prep_pipe = Pipeline(
    stages=[indexer, tokenizer, remove_stop_words, unigrammer, hashingTF, idf, clean_up]
)
transformed = data_prep_pipe.fit(spark_df).transform(spark_df)
clean_data = transformed.select(['label','features'])

训练模型

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=12345)
model = nb.fit(training)
test_results = model.transform(testing)

评估模型

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting label was: {}".format(acc))

在我的机器上，打印：

Accuracy of model at predicting label was: 0.0305764788269

错误信息

如果我改变num_labels到 32 或更高，这是我调用时收到的错误model.fit():

Py4JJavaError：调用 o1336.fit 时发生错误。 : org.apache.spark.SparkException：作业由于阶段失败而中止：阶段 86.0 中的任务 0 失败了 4 次，最近一次失败：丢失任务 0.3 阶段 86.0（TID 1984，someserver.somecompany.net，执行器 22）：org.apache.spark.SparkException：Kryo 序列化失败：缓冲区溢出。可用：7，必需：8 序列化跟踪：值（org.apache.spark.ml.linalg.DenseVector）。为了避免这种情况，增加 Spark.kryoserializer.buffer.最大值。 ... ... 等等等等更多永远持续下去的java东西

Notes

在此示例中，如果我添加二元组的功能，则在以下情况下会发生错误num_labels> 15. 我想知道这也是小于 2 的幂 1 是否是巧合。
在我的实际项目中，我在尝试调用时也会遇到错误model.theta。（我认为错误本身没有意义 - 它们只是从 java/scala 方法传回的异常。）

硬限制:

Number of features * Number of classes has to be lower Integer.MAX_VALUE (2³¹ - 1). You are nowhere near these value.

软限制:

Theta 矩阵（条件概率）的大小为特征数 * 类数。 Theta 既存储在驱动程序本地（作为模型的一部分），又序列化并发送给工作人员。这意味着所有机器至少需要足够的内存来序列化或反序列化并存储结果。

Since you use default settings for HashingTF.numFeatures (2²⁰) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.

您还可以通过设置尝试使用标准 Java 序列化：

 spark.serializer org.apache.spark.serializer.JavaSerializer

由于您使用 PySparkpyspark.ml and pyspark.sql在没有显着性能损失的情况下，这可能是可以接受的。

除了配置之外，我将重点关注功能工程组件。使用二进制CountVetorizer（请参阅关于HashingTF下）与ChiSqSelector可能提供一种既提高可解释性又有效减少特征数量的方法。您还可以考虑更复杂的方法（确定特征重要性并仅在数据子集上应用朴素贝叶斯、更高级的文本处理（例如词形还原/词干提取）或使用自动编码器的某些变体来获得更紧凑的向量表示）。

Notes:

请记住，跨国朴素贝叶斯仅考虑二元特征。NaiveBayes将在内部处理这个问题，但我仍然建议使用setBinary为了清楚起见。
可以说HashingTF这里是相当无用的。除了哈希冲突之外，高度稀疏的特征和本质上无意义的特征，使得它作为预处理步骤的糟糕选择NaiveBayes.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)