1. R语言中grep函数和gsub()函数的使用

2023-11-09

1.grep 函数

1）语法结构

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
各参数的含义如下：
（1）pattern: 字符串类型，正则表达式，指定搜索模式，当将fixed参数设置为TRUE时，也可以是一个待搜索的字符串。
（2）x : 字符串向量，用于被搜索的字符串。
（3）ignore.case: 是否忽略大小写。为FALSE时，大小写敏感，为TRUE时，忽略大小写。
（4）perl: 用于指定是否Perl兼容的正则表达式
（5）value：逻辑值，为FALSE时，grep返回搜索结果的位置信息，为TRUE时，返回结果位置的值。
（6）fixed:逻辑值，为TRUE时，按pattern指定的字符串进行原样搜索，且会忽略产生冲突的参数设置。
（7） useBytes：逻辑值，如果为真，则按字节进行匹配，而不是按字符进行匹配。
（8）invert：逻辑值，如果为TRUE，则返回未匹配项的索引或值，也就是反向搜索。

2) 案例学习

（1）提取gene1到gene40中末尾是3的基因，提取末尾不是3的基因，提取末尾是3但不是gene3的基因.

geen = paste0("gene",1:40)
# 或者str_c("gene",1:40) # 注意：library(stringr)
1. 含有3的基因
geen[grep("3",geen)]  # grep("3",geen,value = T)
# [1] "gene3"  "gene13" "gene23" "gene30" "gene31"
# [6] "gene32" "gene33" "gene34" "gene35" "gene36"
# [11] "gene37" "gene38" "gene39"
2.末尾是3的基因
geen[grep("3$",geen)] # 或者grep("3$",geen,value = T)
# [1] "gene3"  "gene13" "gene23" "gene33"
3.末尾不是3的基因
geen[-grep("3$",geen)] # 或者 grep("3$",geen,invert = T,value = TRUE)
# [1] "gene1"  "gene2"  "gene4"  "gene5"  "gene6" 
# [6] "gene7"  "gene8"  "gene9"  "gene10" "gene11"
# [11] "gene12" "gene14" "gene15" "gene16" "gene17"
# [16] "gene18" "gene19" "gene20" "gene21" "gene22"
# [21] "gene24" "gene25" "gene26" "gene27" "gene28"
# [26] "gene29" "gene30" "gene31" "gene32" "gene34"
# [31] "gene35" "gene36" "gene37" "gene38" "gene39"
# [36] "gene40"
4.提取末尾是3但不是gene3的基因.
grep("[0-9]3$",geen,value = TRUE) 或者 setdiff(grep("3$",geen,value = T),"gene3")
# [1] "gene13" "gene23" "gene33"

3) grep 和grepl的区别

1.语法结构
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE);

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

2. 返回值
grep函数：在向量x中寻找含有特定字符串（pattern参数指定）的元素，返回其在x中的下标；
grepl函数：返回逻辑向量（TRUE，FALSE），即是否包含pattern

2. gsub（）函数

gsub()可以用于字段的删减、增补、替换和切割，可以处理一个字段也可以处理由字段组成的向量。

1.用法：gsub(“目标字符”, “替换字符”, 对象)

text1 <- "ABcdEfgh . ljkl MNNM"
gsub("Efg","RRR",text1)  # #将Efg改为RRR，区分大小写

# 任何符号，包括空格、Tab和换行都是可以识别的
gsub(" l","q",text1)   # #可识别空格
# [1] "ABcdEfgh .qjkl MNNM"

# 同时字符可以识别多个，进行批量置换
gsub("M","O",text1)
# [1] "ABcdEfgh . ljkl ONNO"

# 除此之外，gsub还有其他批量操作的方法
gsub("^.*l(j).*$","\\1",text1) ##只保留一个j
# [1] "j"

gsub("^.* ", "a", text1) #选择从开头到最后一个空格（注意字符"^.* "后引号前有一个空格）替换为a
# [1] "aMNNM"

gsub(" .*","a",text1)  #第一个空格直达结尾替换成a

gsub("\\..*","\\+",text1)  # #句号.和加号+是特殊的，要添加\\来识别
# [1] "ABcdEfgh +"

gsub("\\ ..*","",text1)
# [1] "ABcdEfgh"

gsub("\\.","\\+",text1)
# [1] "ABcdEfgh + ljkl MNNM"
gsub("\\s","a",text1)
# [1] "ABcdEfgha.aljklaMNNM"

2. 特殊字符

Syntax	Description
\\d	Digit, 0,1,2 ... 9
\\D	Not Digit
\\s	Space
\\S	Not Space
\\w	Word
\\W	Not Word
\\t	Tab
\\n	New line
^	Beginning of the string
$	End of the string
\	Escape special characters, e.g. \\ is "\", \+ is "+"
|	Alternation match. e.g. /(e|d)n/ matches "en" and "dn"
•	Any character, except \n or line terminator
[ab]	a or b
[^ab] Any character except a and b
[0-9]	All Digit
[A-Z]	All uppercase A to Z letters
[a-z]	All lowercase a to z letters
[A-z]	All Uppercase and lowercase a to z letters
i+	i at least one time
i*	i zero or more times
i?	i zero or 1 time
i{n}	i occurs n times in sequence
i{n1,n2}	i occurs n1 - n2 times in sequence
i{n1,n2}?	non greedy match, see above example
i{n,}	i occures >= n times
[:alnum:]	Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:]	Alphabetic characters: [:lower:] and [:upper:]
[:blank:]	Blank characters: e.g. space, tab
[:cntrl:]	Control characters
[:digit:]	Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:]	Graphical characters: [:alnum:] and [:punct:]
[:lower:]	Lower-case letters in the current locale
[:print:]	Printable characters: [:alnum:], [:punct:] and space
[:punct:]	Punctuation character: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
[:space:]	Space characters: tab, newline, vertical tab, form feed, carriage return, space
[:upper:]	Upper-case letters in the current locale
[:xdigit:]	Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

3. sub()和gsub()函数有什么区别

text <- c("we are the world", "we are the children")
sub("w", "W", text)  #第一个句子有两个w，但sub()只识别第一个相应的字符
# [1] "We are the world"    "We are the children"
sub("W","w",text)
# [1] "we are the world"    "we are the children"
gsub("W","w",text) #gsub()识别全部对应的字符
# [1] "we are the world"    "we are the children"
gsub("w","W",text) 
# [1] "We are the World"    "We are the children"

1.sub（）和gsub（）的区别在于，前者只替换第一次匹配的字符串，而后者会替换掉所有匹配的字符串。
2.gsub()是对向量里面的每个元素进行搜素，如果发现元素里面有多个位置匹配了模式，则全部进行替换，而grep()也是对向量里每个元素进行搜索，但它仅仅知道元素是否匹配了模式（并返回该元素在向量中的下标），但具体元素中匹配了多少次却无法知道。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

R语言