最初的问题来自 2013 年。同时,在 2015 年 2 月,重复或类似的问题已得到解答:
如何重新连接R tm包中的PCorpus? https://stackoverflow.com/questions/28377646/how-to-reconnect-to-the-pcorpus-in-the-r-tm-package。这篇文章中的答案很重要,尽管相当简约,所以我将尝试在这里补充它。
这些是我在处理类似问题时刚刚发现的一些评论:
请注意,dbInit()
函数不是 tm 包的一部分。
首先你需要安装filehash
包,其中tm
- 文档仅“建议”安装。这意味着它不是硬依赖tm
.
据说,您还可以使用filehashSQLite
封装有library("filehashSQLite")
代替library("filehash")
,并且由于面向对象的设计,这两个包具有相同的接口并且可以无缝地协同工作。因此还要安装“filehashSQLite”(2016 年编辑:filehashSQLite 未实现某些函数,例如 tn::content_transformer())。
那么这有效:
library(filehashSQLite)
# this string becomes filename, must not contain dots.
# Example: "mydata.sqlite" is not permitted.
s <- "sqldb_pcorpus_mydata" #replace mydat with something more descriptive
suppressMessages(library(filehashSQLite))
if(! file.exists(s)){
# csv is a data frame of 900 documents, 18 cols/features
pc = PCorpus(DataframeSource(csv), readerControl = list(language = "en"), dbControl = list(dbName = s, dbType = "SQLite"))
dbCreate(s, "SQLite")
db <- dbInit(s, "SQLite")
set.seed(234)
# add another record, just to show we can.
# key="test", value = "Hi there"
dbInsert(db, "test", "hi there")
} else {
db <- dbInit(s, "SQLite")
pc <- dbLoad(db)
}
show(pc)
# <<PCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
#Content: documents: 900
dbFetch(db, "test")
# remove it
rm(db)
rm(pc)
#reload it
db <- dbInit(s, "SQLite")
pc <- dbLoad(db)
# the corpus entries are now accessible, but not loaded into memory.
# now 900 documents are bound via "Active Bindings", created by makeActiveBinding() from the base package
show(pc)
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
# ...
# [900]
#[883] "883" "884" "885" "886" "887" "888" "889" "890" "891" "892"
#"893" "894" "895" "896" "897" "898" "899" "900"
#[901] "test"
dbFetch(db, "900")
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 33
dbFetch(db, "test")
#[1] "hi there"
这就是数据库后端的样子。您可以看到数据帧中的文档已在 sqlite 表内以某种方式进行编码。
This is what my RStudio IDE shows me:
![enter image description here](https://i.stack.imgur.com/0siTr.png)