[R]无监督聚类-ConsensusCluster & NMF

Beccaliii

沉迷潜水各种意义上的:)

针对基因表达矩阵或者芯片数据，可以基于表达水平对数据进行无监督聚类分析。完成这个功能的算法有很多种，如非负矩阵分解(nonnegative matrix factorization)，一致性聚类(consensus clustering), 自组织映射(SOM clustering)等，文章 Metagenes and molecular pattern discovery using matrix factorization 介绍和比较了几种算法，文章总结NMF算法更优，但其他几种算法针对不同特征的数据也各有偏好，可以多了解一下算法再自行选择。

之前看一篇文章内容中涉及NMF算法而代码用的是ConsensusClusterPlus包，未确认 算法文章 就认为是用了同一个算法，混淆了算法和对应R包，谢谢评论指正。以此为鉴，算法千千万，还是要再仔细谨慎一点。

输入数据格式

- nonnegative matrix factorization

选择cophenetic下降最快的前一点做较优rank（第二个参数）。

library(NMF)
library(doMPI) #rank设定4以上需要用这个包
#选定rank
# compute NMF for each method， rank设定为2:7 （2，3，4，5，6，7都试一遍一起比较）
res <- nmf(dat,2:7,nrun=10)
plot(res)

#选用较优rank再运行一次NMF
res <- nmf(dat,5,nrun=10)
coefmap(res)
consensusmap(res)

这里的选择用默认设置的热图来展示结果，详细的还有很多参数可以优化结果输出形式，这一步还没有太理解，仅放图做参考哈。

NMF包中包含不同的算法method（第三个参数），默认使用brunet，可以比较算法择优录用

res.multi.method <- nmf(dat, 3, list('brunet', 'lee', 'ns'), seed=222)
compare(res.multi.method)

- consensus clustering

根据consensus10的图，选择转向平缓的拐点对应的x作为最优k（分组数）。所有直接输出到 output/ （step4）文件夹下。要注意调整step 2 中运用的基因数目，如果运用基因过多可能效果适得其反。

###################################################
### step 1: ConsensusClusterPlus.
###################################################
info <- as.data.frame(read.table("data/exprSet_symbol.csv",header = TRUE,sep = ",", dec = ".",na.strings = "NA",stringsAsFactors=FALSE,check.names = FALSE))
#row.names(info) <- info[,1]
#info<-as.data.frame(info[,-1])
#remove outlier if necessary
info<-info[,!colnames(info) %in% c("sample20")]
dat=info
###################################################
### step 2: ConsensusClusterPlus.
###################################################
d=dat
gene_e=8500 # 要提取的基因数，一般是70%左右
#按需选出决定分组的基因集
mads=apply(d,1,mad)
d=d[rev(order(mads))[1:gene_e],]
###################################################
### step 3: ConsensusClusterPlus.
###################################################
d = sweep(d,1, apply(d,1,median,na.rm=T))
###################################################
### step 4: ConsensusClusterPlus.
###################################################
library(ConsensusClusterPlus)
#title=tempdir()
#最大设置6个分组，重复1000次，结果输出到文件夹output中
results = ConsensusClusterPlus(as.matrix(d),maxK=6,reps=1000,pItem=0.8,pFeature=1,
title="output",
clusterAlg="hc",distance="pearson",seed=12621,plot="png",writeTable=F)
###################################################
### step 5: ConsensusClusterPlus.
###################################################
#cat(sprintf("\\graphicspath{{%s}}", paste(gsub("[\\]","/",title),"/",sep="")))
#cat("\n")
###################################################
### step 6: ConsensusClusterPlus.
###################################################
#consensusMatrix - the consensus matrix.  
#For .example, the top five rows and columns of results for k=2:
results[[3]][["consensusMatrix"]][1:5,1:5]
#consensusTree - hclust object 
results[[3]][["consensusTree"]]
#consensusClass - the sample classifications
clu3=as.data.frame(results[[3]][["consensusClass"]])
colnames(clu3)="cluster3"