我正在尝试将数据集加载到 pandas 中,但似乎无法完成第 1 步。我是新手,所以如果这是显而易见的,请原谅,我已经搜索了以前的主题,但没有找到答案。数据大部分是汉字,这可能是问题所在。
.csv 非常大,可以在此处找到:http://weibscope.jmsc.hku.hk/datazip/ http://weiboscope.jmsc.hku.hk/datazip/我正在第一周尝试。
在下面的代码中,我确定了尝试的 3 种解码类型,包括尝试查看使用的编码
import pandas
import chardet
import os
#this is what I tried to start
data = pandas.read_csv('week1.csv', encoding="utf-8")
#spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte
#Code to check encoding -- this spits out ascii
bytes = min(32, os.path.getsize('week1.csv'))
raw = open('week1.csv', 'rb').read(bytes)
chardet.detect(raw)
#so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway
data = pandas.read_csv('week1.csv', encoding="ascii")
#spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
#for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars
data = pandas.read_csv('week1.csv', encoding="latin1")
任何帮助将不胜感激!
编辑:@Kristof 提供的答案实际上确实有效,就像我的同事昨天编写的程序一样:
import csv
import pandas as pd
def clean_weiboscope(file, nrows=0):
res = []
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
reader = csv.reader(f)
for i, row in enumerate(f):
row = row.replace('\n', '')
if nrows > 0 and i > nrows:
break
if i == 0:
headers = row.split(',')
else:
res.append(tuple(row.split(',')))
df = pd.DataFrame(res)
return df
my_df = clean_weiboscope('week1.csv', nrows=0)
我还想为未来的搜索者补充一点,这是 2012 年 Weibscope 的开放数据。