更新标记生成器后,当我运行此行时:
数据集 = LineByLineTextDataset(tokenizer=bert_tokenizer, file_path="./some_file.txt",
块大小=128,)
它会永远加载。
这是完整的代码:
from transformers import BertTokenizer, BertForMaskedLM
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
new_tokens = []
text = open("parsed_data.txt", "r")
for line in text:
for word in line.split():
new_tokens.append(word)
print(len(bert_tokenizer)) # 30522
bert_tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(bert_tokenizer))
print(type(new_tokens))
print(len(new_tokens)) # 53966
print(len(bert_tokenizer)) # 36369
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=bert_tokenizer,
file_path="./parsed_data.txt",
block_size=128,
)
parsed_data.txt 文件包含简单文本。
之前有人发过同样的问题。
链接:github.com/huggingface/transformers/issues/5944
None
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)