1.grep 函数
1)语法结构
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
各参数的含义如下:
(1)pattern: 字符串类型,正则表达式,指定搜索模式,当将fixed参数设置为TRUE时,也可以是一个待搜索的字符串。
(2)x : 字符串向量,用于被搜索的字符串。
(3)ignore.case: 是否忽略大小写。为FALSE时,大小写敏感,为TRUE时,忽略大小写。
(4)perl: 用于指定是否Perl兼容的正则表达式
(5)value:逻辑值,为FALSE时,grep返回搜索结果的位置信息,为TRUE时,返回结果位置的值。
(6)fixed:逻辑值,为TRUE时,按pattern指定的字符串进行原样搜索,且会忽略产生冲突的参数设置。
(7) useBytes:逻辑值,如果为真,则按字节进行匹配,而不是按字符进行匹配。
(8)invert:逻辑值,如果为TRUE,则返回未匹配项的索引或值,也就是反向搜索。
2) 案例学习
(1)提取gene1到gene40中末尾是3的基因,提取末尾不是3的基因,提取末尾是3但不是gene3的基因.
geen = paste0("gene",1:40)
# 或者str_c("gene",1:40) # 注意:library(stringr)
1. 含有3的基因
geen[grep("3",geen)] # grep("3",geen,value = T)
# [1] "gene3" "gene13" "gene23" "gene30" "gene31"
# [6] "gene32" "gene33" "gene34" "gene35" "gene36"
# [11] "gene37" "gene38" "gene39"
2.末尾是3的基因
geen[grep("3$",geen)] # 或者grep("3$",geen,value = T)
# [1] "gene3" "gene13" "gene23" "gene33"
3.末尾不是3的基因
geen[-grep("3$",geen)] # 或者 grep("3$",geen,invert = T,value = TRUE)
# [1] "gene1" "gene2" "gene4" "gene5" "gene6"
# [6] "gene7" "gene8" "gene9" "gene10" "gene11"
# [11] "gene12" "gene14" "gene15" "gene16" "gene17"
# [16] "gene18" "gene19" "gene20" "gene21" "gene22"
# [21] "gene24" "gene25" "gene26" "gene27" "gene28"
# [26] "gene29" "gene30" "gene31" "gene32" "gene34"
# [31] "gene35" "gene36" "gene37" "gene38" "gene39"
# [36] "gene40"
4.提取末尾是3但不是gene3的基因.
grep("[0-9]3$",geen,value = TRUE) 或者 setdiff(grep("3$",geen,value = T),"gene3")
# [1] "gene13" "gene23" "gene33"
3) grep 和grepl的区别
1.语法结构
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE);
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
2. 返回值
grep函数:在向量x中寻找含有特定字符串(pattern参数指定)的元素,返回其在x中的下标;
grepl函数:返回逻辑向量(TRUE,FALSE),即是否包含pattern
2. gsub()函数
gsub()可以用于字段的删减、增补、替换和切割,可以处理一个字段也可以处理由字段组成的向量。
1.用法:gsub(“目标字符”, “替换字符”, 对象)
text1 <- "ABcdEfgh . ljkl MNNM"
gsub("Efg","RRR",text1) # #将Efg改为RRR,区分大小写
# 任何符号,包括空格、Tab和换行都是可以识别的
gsub(" l","q",text1) # #可识别空格
# [1] "ABcdEfgh .qjkl MNNM"
# 同时字符可以识别多个,进行批量置换
gsub("M","O",text1)
# [1] "ABcdEfgh . ljkl ONNO"
# 除此之外,gsub还有其他批量操作的方法
gsub("^.*l(j).*$","\\1",text1) ##只保留一个j
# [1] "j"
gsub("^.* ", "a", text1) #选择从开头到最后一个空格(注意字符"^.* "后引号前有一个空格)替换为a
# [1] "aMNNM"
gsub(" .*","a",text1) #第一个空格直达结尾替换成a
gsub("\\..*","\\+",text1) # #句号.和加号+是特殊的,要添加\\来识别
# [1] "ABcdEfgh +"
gsub("\\ ..*","",text1)
# [1] "ABcdEfgh"
gsub("\\.","\\+",text1)
# [1] "ABcdEfgh + ljkl MNNM"
gsub("\\s","a",text1)
# [1] "ABcdEfgha.aljklaMNNM"
2. 特殊字符
Syntax Description
\\d Digit, 0,1,2 ... 9
\\D Not Digit
\\s Space
\\S Not Space
\\w Word
\\W Not Word
\\t Tab
\\n New line
^ Beginning of the string
$ End of the string
\ Escape special characters, e.g. \\ is "\", \+ is "+"
| Alternation match. e.g. /(e|d)n/ matches "en" and "dn"
• Any character, except \n or line terminator
[ab] a or b
[^ab] Any character except a and b
[0-9] All Digit
[A-Z] All uppercase A to Z letters
[a-z] All lowercase a to z letters
[A-z] All Uppercase and lowercase a to z letters
i+ i at least one time
i* i zero or more times
i? i zero or 1 time
i{n} i occurs n times in sequence
i{n1,n2} i occurs n1 - n2 times in sequence
i{n1,n2}? non greedy match, see above example
i{n,} i occures >= n times
[:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:] Alphabetic characters: [:lower:] and [:upper:]
[:blank:] Blank characters: e.g. space, tab
[:cntrl:] Control characters
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:] Graphical characters: [:alnum:] and [:punct:]
[:lower:] Lower-case letters in the current locale
[:print:] Printable characters: [:alnum:], [:punct:] and space
[:punct:] Punctuation character: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, space
[:upper:] Upper-case letters in the current locale
[:xdigit:] Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
3. sub()和gsub()函数有什么区别
text <- c("we are the world", "we are the children")
sub("w", "W", text) #第一个句子有两个w,但sub()只识别第一个相应的字符
# [1] "We are the world" "We are the children"
sub("W","w",text)
# [1] "we are the world" "we are the children"
gsub("W","w",text) #gsub()识别全部对应的字符
# [1] "we are the world" "we are the children"
gsub("w","W",text)
# [1] "We are the World" "We are the children"
1.sub()和gsub()的区别在于,前者只替换第一次匹配的字符串,而后者会替换掉所有匹配的字符串。
2.gsub()是对向量里面的每个元素进行搜素,如果发现元素里面有多个位置匹配了模式,则全部进行替换,而grep()也是对向量里每个元素进行搜索,但它仅仅知道元素是否匹配了模式(并返回该元素在向量中的下标),但具体元素中匹配了多少次却无法知道。