You:
-
really需要阅读第 3 节RFC 3696(总长:
@
可以出现在多个地方)
- 似乎没有想到,一封邮件也可以“
[email protected]
", "[email protected]
“(即天真地假设只有一个域可能会在分析中的某个时刻回来咬你)
- 应该注意的是,如果您确实正在寻找电子邮件“域名”,那么您还必须考虑什么真正构成域名和适当的后缀.
So — 除非您确定您拥有并将始终拥有简单的电子邮件地址——我可以建议:
library(stringi)
library(urltools)
library(dplyr)
library(purrr)
emails <- c("[email protected]", "[email protected]",
"[email protected]",
"[email protected]",
"[email protected]")
stri_locate_last_fixed(emails, "@")[,"end"] %>%
map2_df(emails, function(x, y) {
substr(y, x+1, nchar(y)) %>%
suffix_extract()
})
## host subdomain domain suffix
## 1 gmail.com <NA> gmail com
## 2 hotmail.com <NA> hotmail com
## 3 deparment.example.com department example com
## 4 yet.another.department.com yet.another department com
## 5 froodyco.co.uk <NA> froodyorg co.uk
请注意子域、域和后缀的正确拆分,尤其是最后一个。
知道了这一点,我们就可以将代码更改为:
stri_locate_last_fixed(emails, "@")[,"end"] %>%
map2_chr(emails, function(x, y) {
substr(y, x+1, nchar(y)) %>%
suffix_extract() %>%
mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
select(full_domain) %>%
flatten_chr()
})
## [1] "gmail" "hotmail"
## [3] "department.example" "yet.another.department"
## [5] "froodyorg"