我正在尝试从电子邮件的纯文本记录中提取电子邮件地址。
我拼凑了一些代码来查找地址本身,但我不知道如何让它区分它们;现在它只是输出文件中的所有电子邮件地址。我想让它只吐出以“发件人:”和一些通配符开头,并以“>”结尾的地址(因为电子邮件设置为“发件人[名称]” )。
现在是代码:
import re #allows program to use regular expressions
foundemail = []
#this is an empty list
mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
#do not currently know exact meaning of this expression but assuming
#it means something like "[stuff]@[stuff][stuff1-4 letters]"
# "line" is a variable is set to a single line read from the file
# ("text.txt"):
for line in open("text.txt"):
foundemail.extend(mailsrch.findall(line))
# this extends the previously named list via the "mailsrch" variable
#which was named before
print foundemail
试试这个:
>>> from email.utils import parseaddr
>>> parseaddr('From: [email protected] /cdn-cgi/l/email-protection')
('', '[email protected] /cdn-cgi/l/email-protection')
>>> parseaddr('From: Van Gale <[email protected] /cdn-cgi/l/email-protection>')
('Van Gale', '[email protected] /cdn-cgi/l/email-protection')
>>> parseaddr(' From: Van Gale <[email protected] /cdn-cgi/l/email-protection> ')
('Van Gale', '[email protected] /cdn-cgi/l/email-protection')
>>> parseaddr('blah abdf From: Van Gale <[email protected] /cdn-cgi/l/email-protection> and this')
('Van Gale', '[email protected] /cdn-cgi/l/email-protection')
不幸的是,它只找到每行中的第一封电子邮件,因为它需要标题行,但也许这样可以?
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)