相比version2,version3数据库拓了很多,还有3支持自建数据库了。
Github: https://github.com/biobakery/MetaPhlAn
Manual: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0
Wiki: https://github.com/biobakery/biobakery/wiki/metaphlan3MetaPhlAn依赖~1.1M的clade-specific的marker genes,
mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
这些marker genes由~100,000 reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic)确定。该数据可用于物种分类,株水平分析,定量,快速计算。安装 - manual 给的方法
conda create --name mpa -c bioconda python=3.7 metaphlan conda activate mpa
不出意外的话,会报错
undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_initializeEv
bowtie2-build-s: symbol lookup error, undefined symbol
安装 - 针对tbb版本问题
conda create -n metaphlan python=3.7 conda activate metaphlan conda install tbb=2020.2 conda install bowtie2 conda install -c bioconda metaphlan metaphlan --version # MetaPhlAn version 3.0.13 (27 Jul 2021) bowtie2 --version # bowtie2-align-s version 2.4.4 # Compiler: gcc version 9.3.0 Zenodo
Google Driver 文件列表:
bcftools https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACPUjNhnsIIJn16-ww6-MWOa/bcftools?dl=1 metaphlan2_homebrew_counter.txt https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAD6ImeA91we2nBWBhpfaEOqa/metaphlan2_homebrew_counter.txt?dl=1 mpa_latest https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAyoJpOgcjop41VIHAGWIVLa/mpa_latest?dl=1 mpa_v20_m200_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABKU_RAK5yOzyhV27NpOduDa/mpa_v20_m200_marker_info.txt.bz2?dl=1 mpa_v20_m200.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADS8nukx3dSoiR82OHw6dOka/mpa_v20_m200.md5?dl=1 mpa_v20_m200.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAASBOj-2gAbA53cV1bXBULYa/mpa_v20_m200.tar?dl=1 mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADhE2Ur7JqirifOdgi4fGQEa/mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1 mpa_v29_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADdxAWsjLPLjy10VICgSAEPa/mpa_v29_CHOCOPhlAn_201901.md5?dl=1 mpa_v29_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAACeczBU6P9lIBD4ZYtxwKva/mpa_v29_CHOCOPhlAn_201901.tar?dl=1 mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADoxBNynVopWt2shYYKQ2Mba/mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1 mpa_v292_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAApZLRLD0Bkb86bvH-Y3-tUa/mpa_v292_CHOCOPhlAn_201901.md5?dl=1 mpa_v292_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABK4Zns2PYUh_R-sLGEx_Bza/mpa_v292_CHOCOPhlAn_201901.tar?dl=1 mpa_v293_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAsg2PN5Ng6uwHnEemlqo3-a/mpa_v293_CHOCOPhlAn_201901.md5?dl=1 mpa_v293_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABFg8C3gMyNpYNTAY5PcSONa/mpa_v293_CHOCOPhlAn_201901.tar?dl=1 mpa_v294_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA6KBc-nd8_C4bJOPVjzWW7a/mpa_v294_CHOCOPhlAn_201901.md5?dl=1 mpa_v294_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAJSiG2qkfNMcYzhvz4Rxsga/mpa_v294_CHOCOPhlAn_201901.tar?dl=1 mpa_v295_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACIWLuO_0ixZsxiSZZG-4M1a/mpa_v295_CHOCOPhlAn_201901.md5?dl=1 mpa_v295_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA0y5qvPjJdxVw86Ovjinipa/mpa_v295_CHOCOPhlAn_201901.tar?dl=1 mpa_v296_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAbRMEnsewv_pHFsLv2Be0va/mpa_v296_CHOCOPhlAn_201901.md5?dl=1 mpa_v296_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABAD51gd3wr0___2MWO0xD-a/mpa_v296_CHOCOPhlAn_201901.tar?dl=1 mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAv_ShZiz7pNTaT_YONJTF7a/mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1 mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAlyQITZuUCtBUJxpxhIroIa/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1 mpa_v30_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADlxibskzbPHPoDl6S-FyKka/mpa_v30_CHOCOPhlAn_201901.tar?dl=1 mpa_v30_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACTzoUYDqZps8u2JqWCNCODa/mpa_v30_CHOCOPhlAn_201901.md5?dl=1 SRS019033.fastq https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACh1NQExDk39RXzZOyTPmQwa/SRS019033.fastq?dl=1 strainphlan_homebrew_counter.txt https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAZO5uydQ2QQkOSkYu5DxRha/strainphlan_homebrew_counter.txt?dl=1
Win下载:marker genes数据
# 解压 15M -> 336M bzip2 -d mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2 less -S mpa_v30_CHOCOPhlAn_201marker_info.txt | wc -l # 1132198 # 113万行
一个物种有多个marker genes,因此总基因大于物种数
# 剖开看看 39444__GeneID:11747987 {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le 39444__GeneID:11747985 {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le 39444__GeneID:11747982 {'ext': ['PRJNA15142', 'PRJNA14595', 'PRJNA20981'], 'sco 39444__GeneID:11747981 {'ext': [], 'score': 0.0, 'clade': 's__Cotia_virus', 'le
bzip2压缩了20倍,牛皮呀呀呀呀呀呀
Win下载:基因组数据
tar -xvf mpa_v30_CHOCOPhlAn_201901.tar rm mpa_v30_CHOCOPhlAn_201901.tar bzip2 -d mpa_v30_CHOCOPhlAn_201901.fna.bz2
# 敲开kankan>1000373__GeneID:11569613 less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | head # 是基因的序列信息 >1000373__GeneID:11604940 >1000373__GeneID:11604942 >1000373__GeneID:11604944 >100053__V6HSA9__LEP1GSC062_4244 UniRef90_V6HSA9;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815 >100053__V6HSV5__LEP1GSC062_0144 UniRef90_V6HSV5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815 >100053__V6HSX2__LEP1GSC062_1624 UniRef90_V6HSX2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815 >100053__V6HT46__LEP1GSC062_1702 UniRef90_V6HT46;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815 >100053__V6HT56__LEP1GSC062_1778 UniRef90_V6HT56;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815 >100053__V6HTZ5__LEP1GSC062_0133 UniRef90_V6HTZ5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | awk -F" " '{print $2}' | sed -e '/^$/d' | awk -F";" '{print $2}' | uniq | wc -l # 10164 行(物种)
bowtie2建数据库索引
bowtie2-build --threads 8 \ -f mpa_v30_CHOCOPhlAn_201901.fna \ mpa_v30_CHOCOPhlAn_201901 source /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/software/miniconda3/etc/profile.d/conda.sh conda activate metaphlan route="/hwfssz5/ST_META/TMP1T.2/soil" metaphlan3db="/hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/metaphlan_v30/" mkdir $route/Result/metaphlan3/$infile cat $route/Result/kneaddata/$infile/${infile}_1_kneaddata.repeats.removed.*.fastq > $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq metaphlan $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq \ --input_type fastq \ --bowtie2db $metaphlan3db \ -x mpa_v30_CHOCOPhlAn_201901 \ --nproc 8 \ -o $route/Result/metaphlan3/$infile/${infile}_metaphlan3.out rm $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq
--input_type {fastq,fasta,bowtie2out,sam}
--bowtie2db METAPHLAN_BOWTIE2_DB
-x INDEX Specify the id of the database version to use
--bt2_ps BowTie2 presets [default very-sensitive]
--add_viruses Allow the profiling of viral organisms
--ignore_eukaryotes Do not profile eukaryotic organisms
--ignore_bacteria Do not profile bacterial organisms
--ignore_archaea Do not profile archeal organisms
--nproc N The number of CPUs to use for parallelizing the mapping [default 4]qsub提交任务:
route="/hwfssz5/ST_META/TMP1T.2/soil" qsub -cwd -l vf=80G,p=8 -q st_supermem.q -P st_supermem -binding linear:8 $route/script/q_metaphlan3.sh -i 49
conda activate metaphlan python /route/metaphlan/bin/merge_metaphlan_tables.py \ -o ./abundance/abundance_merged.txt \ ./abundance/*.out
unifrac样本距离
下载metaphlan nwk文件:
https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwkR安装依赖:
conda activate r411 conda install r-vegan r-ape r-rbiom
获取计算R脚本:
https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/calculate_unifrac.R计算距离:
Rscript $unifrac_route/calculate_unifrac.R \ abundance_merged.txt \ $nwk_route/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk \ abundance_merged_unifrac_profiles.txt
警告
nwk信息不全WARNING: 4 species not present in the tree were removed from the input profile. Nitrolancea_hollandica Singulisphaera_acidiphila Leptothrix_ochracea Sutterella_parvirubra
hclust2.py画热图
conda activate -c bioconda hclust2
hclust2.py \ -i abundance_merged.txt \