我想调用 NLTK 通过 pyspark 在 databricks 上做一些 NLP。
我已经从 databricks 的库选项卡安装了 NLTK。它应该可以从所有节点访问。
我的 py3 代码:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
我运行上面的代码并得到:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
当我运行以下代码时:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
我收到错误:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
我已经下载了“朋克”。它位于
/root/nltk_data/tokenizers
我已经使用文件夹位置更新了 Spark 环境中的 PATH。
为什么找不到?
解决方案位于NLTK。未找到朋克 https://stackoverflow.com/questions/55297145/nltk-punkt-not-found和这个如何从代码中配置nltk数据目录? https://stackoverflow.com/questions/3522372/how-to-config-nltk-data-directory-from-code/22987374#22987374但他们都不适合我。
我已尝试更新
nltk.data.path.append('/root/nltk_data/tokenizers/')
这是行不通的。
看来nltk看不到新添加的路径!
我还将 punkz 复制到 nltk 将搜索的路径中。
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
但是,nltk仍然看不到它。
thanks