目录
这个需求还是很常见的,因为我们在处理数据的时候无法全面考虑到数据框中含有哪些类型的数据,比如含有NA、NaN或Inf,甚至是一些乱七八糟的字符串。这时不论做统计分析还是作图,都会带来意想不到的错误。为防止这种现象发生,有必要在分析数据前将这些含有特殊字符的行去掉。
#如果只是包含NA/Inf/NaN,读入都是视为数值d <- data.frame(x=c(NA,2,3,Inf,-Inf,NaN),y=c(1,Inf,6,NA,4,NaN))dstr(d)> d x y1 NA 12 2 Inf3 3 64 Inf NA5 -Inf 46 NaN NaN'data.frame': 6 obs. of 2 variables: $ x: num NA 2 3 Inf -Inf ... $ y: num 1 Inf 6 NA 4 ...
单独去掉:
> d[!is.na(d$x),] #去掉NA和NaN x y2 2 Inf3 3 64 Inf NA5 -Inf 4> d[!is.nan(d$x),] #去掉NaN x y1 NA 12 2 Inf3 3 64 Inf NA5 -Inf 4> d[!is.infinite(d$x),] #去掉Inf x y1 NA 12 2 Inf3 3 66 NaN NaN
一次去掉:
> d[is.finite(d$x),] #去掉Inf、NA和NaN,推荐 x y2 2 Inf3 3 6> d[!is.na(d$x)&!is.nan(d$x)&!is.infinite(d$x),] x y2 2 Inf3 3 6
如果除了以上三种,还包含其他乱七八糟的字符(一般读入时默认这一列就是因子类型),比如:
d <- data.frame(x=c(NA,2.0,3.3,0.2,4,Inf,NaN,"*","$","#"),y=c(1,NA,4,"*",'&',2,3,4,2,1))> d x y1 <NA> 12 2 <NA>3 3.3 44 0.2 *5 4 &6 Inf 27 NaN 38 * 49 $ 210 # 1> str(d)'data.frame': 10 obs. of 2 variables: $ x: Factor w/ 9 levels "#","$","*","0.2",..: NA 5 6 4 7 8 9 3 2 1 $ y: Factor w/ 6 levels "&","*","1","2",..: 3 NA 6 2 1 4 5 6 4 3
去掉NA还是可以同上:
> d[!is.na(d$x),] x y2 2 <NA>3 3.3 44 0.2 *5 4 &6 Inf 27 NaN 38 * 49 $ 210 # 1
但NaN和Inf就不行了,因为is.nan和is.infinite函数只识别数值型。
> d[!is.nan(d$x),] x y1 <NA> 12 2 <NA>3 3.3 44 0.2 *5 4 &6 Inf 27 NaN 38 * 49 $ 210 # 1> d[!is.infinite(d$x),] x y1 <NA> 12 2 <NA>3 3.3 44 0.2 *5 4 &6 Inf 27 NaN 38 * 49 $ 210 # 1> d[is.finite(d$x),] x y2 2 <NA>3 3.3 44 0.2 *5 4 &6 Inf 27 NaN 38 * 49 $ 210 # 1
如果硬要这么干,就要进行类型转换,注意因子转数值需要字符做桥梁哦~
> d[!is.nan(as.numeric(as.character(d$x))),] x y1 <NA> 12 2 <NA>3 3.3 44 0.2 *5 4 &6 Inf 28 * 49 $ 210 # 1Warning message:In `[.data.frame`(d, !is.nan(as.numeric(as.character(d$x))), ) : NAs introduced by coercion> d[!is.infinite(as.numeric(as.character(d$x))),] x y1 <NA> 12 2 <NA>3 3.3 44 0.2 *5 4 &7 NaN 38 * 49 $ 210 # 1Warning message:In `[.data.frame`(d, !is.infinite(as.numeric(as.character(d$x))), : NAs introduced by coercion> d[is.finite(as.numeric(as.character(d$x))),] x y2 2 <NA>3 3.3 44 0.2 *5 4 &Warning message:In `[.data.frame`(d, is.finite(as.numeric(as.character(d$x))), ) : NAs introduced by coercion
警告信息可以看到,乱七八糟字符强制转换数值视为NA了。因为数据量足够大的时候,我们无法知道数据里还含有什么妖魔鬼怪,这时可以只识别数字来提取(不包含NA、Inf和NaN):
t <- grep("^\\d+$",as.character(d$x))#as.numeric(as.character(d$x[t]))d[t,] #这里还是因子型,根据需要再转换为数值> t[1] 2 5> d[t,] x y2 2 <NA>5 4 &
如果我们是针对整个数据框去除包含非数值的行?
d <- data.frame(x=c(NA,2,3,Inf,-Inf,NaN),y=c(1,Inf,6,NA,4,NaN))> na.omit(d) x y3 3.3 44 0.2 *5 4 &6 Inf 27 NaN 38 * 49 $ 210 # 1> d[!is.nan(rowSums(d)),] x y1 NA 12 2 Inf3 3 64 Inf NA5 -Inf 4> d[!is.infinite(rowSums(d)),] #为啥还有一个Inf的行? x y1 NA 13 3 64 Inf NA6 NaN NaN> d[is.finite(rowSums(d)),] #去掉Inf、NA和NaN,推荐 x y3 3 6
或者使用R包 IDPmisc::NaRv 来处理:
IDPmisc::NaRv
> require(IDPmisc)> NaRV.omit(d) x y3 3 6
我自己随便写的:
> index <- apply(d,1,function(x){grepl("^\\d+$",as.character(x))})> index [,1] [,2] [,3] [,4] [,5] [,6][1,] FALSE TRUE TRUE FALSE FALSE FALSE[2,] TRUE FALSE TRUE FALSE TRUE FALSE> d[apply(index,2,function(x)all(x)),] x y3 3 6
d[is.finite(d$x),]
d[is.finite(as.numeric(as.character(d$x))),]
d[is.finite(rowSums(d)),]IDPmisc::NaRV.omit(d)
index <- apply(d,1,function(x){grepl("^\\d+$",as.character(x))})d[apply(index,2,function(x)all(x)),]
Ref: https://stackoverflow.com/questions/15773189/remove-na-nan-inf-in-a-matrix https://www.thinbug.com/q/25276155
作者: Bioinfarmer
ImgList[0]="https://image1.jpg"; ImgList[1]="https://image2.jpg"; ImgList[2]="https://image3.jpg"; var Img