我有一个行政档案数据集,其中包括简短的传记。我正在尝试使用 python 和一些模式匹配来提取人们的年龄。一些句子的例子是:
- “邦德先生,67 岁,是英国的一名工程师”
- “阿曼达·B·拜恩斯 (Amanda B. Bynes),34 岁,是一名演员”
- “彼得·帕克(45 岁)将成为我们的下一任管理员”
- “迪伦先生今年46岁。”
- “史蒂夫·琼斯,年龄:32,”
这些是我在数据集中发现的一些模式。我想补充一点,还有其他模式,但我还没有遇到它们,并且不确定如何实现它们。我编写的以下代码运行得很好,但效率很低,因此在整个数据集上运行需要太多时间。
#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip() + " \(",
" " + last_name.lower().strip() + " is "]
#for each element in our search list
for element in age_search_list:
print("Searching: ",element)
# retrieve all the instances where we might have an age
for age_biography_instance in re.finditer(element,souptext.lower()):
#extract the next four characters
age_biography_start = int(age_biography_instance.start())
age_instance_start = age_biography_start + len(element)
age_instance_end = age_instance_start + 4
age_string = souptext[age_instance_start:age_instance_end]
#extract what should be the age
potential_age = age_string[:-2]
#extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
age_security_check = age_string[-2:]
age_security_check_list = [", ",". ",") "," y"]
if age_security_check in age_security_check_list:
print("Potential age instance found for ",full_name,": ",potential_age)
#check that what we extracted is an age, convert it to birth year
try:
potential_age = int(potential_age)
print("Potential age detected: ",potential_age)
if 18 < int(potential_age) < 100:
sec_birth_year = int(filing_year) - int(potential_age)
print("Filing year was: ",filing_year)
print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
#Now, we save it in the main dataframe
new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])
except ValueError:
print("Problem with extracted age ",potential_age)
我有几个问题:
- 有没有更有效的方法来提取这些信息?
- 我应该使用正则表达式吗?
- 我的文本文档很长,而且有很多。我可以一次搜索所有项目吗?
- 检测数据集中其他模式的策略是什么?
从数据集中提取的一些句子:
- “2010年授予Love先生的股权奖励占其总薪酬的48%”
- “乔治·F·鲁宾 (George F. Rubin)(14)(15),68 岁,自 1997 年起担任受托人。”
- “INDRA K. NOOYI,56 岁,自 2006 年起担任百事公司首席执行官 (CEO)”
- “47 岁的洛瓦洛先生于 2011 年被任命为财务主管。”
- “查尔斯·贝克先生,79 岁,是生物技术公司的商业顾问。”
- “Botein 先生,43 岁,自我们成立以来一直是我们董事会的成员。”