MetaPhlAn3宏基因组物种注释

相比version2,version3数据库拓了很多,还有3支持自建数据库了。

Github: https://github.com/biobakery/MetaPhlAn
Manual: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0
Wiki: https://github.com/biobakery/biobakery/wiki/metaphlan3

MetaPhlAn依赖~1.1M的clade-specific的marker genes,
mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
这些marker genes由~100,000 reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic)确定。该数据可用于物种分类,株水平分析,定量,快速计算。

安装 - manual 给的方法

conda create --name mpa -c bioconda python=3.7 metaphlan
conda activate mpa

不出意外的话,会报错

undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_initializeEv

bowtie2-build-s: symbol lookup error, undefined symbol

安装 - 针对tbb版本问题

conda create -n metaphlan python=3.7
conda activate metaphlan
conda install tbb=2020.2
conda install bowtie2
conda install -c bioconda metaphlan
metaphlan --version
# MetaPhlAn version 3.0.13 (27 Jul 2021)
bowtie2 --version
# bowtie2-align-s version 2.4.4
# Compiler: gcc version 9.3.0
Zenodo

Google Driver 文件列表:

bcftools    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACPUjNhnsIIJn16-ww6-MWOa/bcftools?dl=1
metaphlan2_homebrew_counter.txt https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAD6ImeA91we2nBWBhpfaEOqa/metaphlan2_homebrew_counter.txt?dl=1
mpa_latest  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAyoJpOgcjop41VIHAGWIVLa/mpa_latest?dl=1
mpa_v20_m200_marker_info.txt.bz2    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABKU_RAK5yOzyhV27NpOduDa/mpa_v20_m200_marker_info.txt.bz2?dl=1
mpa_v20_m200.md5    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADS8nukx3dSoiR82OHw6dOka/mpa_v20_m200.md5?dl=1
mpa_v20_m200.tar    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAASBOj-2gAbA53cV1bXBULYa/mpa_v20_m200.tar?dl=1
mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADhE2Ur7JqirifOdgi4fGQEa/mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v29_CHOCOPhlAn_201901.md5   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADdxAWsjLPLjy10VICgSAEPa/mpa_v29_CHOCOPhlAn_201901.md5?dl=1
mpa_v29_CHOCOPhlAn_201901.tar   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAACeczBU6P9lIBD4ZYtxwKva/mpa_v29_CHOCOPhlAn_201901.tar?dl=1
mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADoxBNynVopWt2shYYKQ2Mba/mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v292_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAApZLRLD0Bkb86bvH-Y3-tUa/mpa_v292_CHOCOPhlAn_201901.md5?dl=1
mpa_v292_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABK4Zns2PYUh_R-sLGEx_Bza/mpa_v292_CHOCOPhlAn_201901.tar?dl=1
mpa_v293_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAsg2PN5Ng6uwHnEemlqo3-a/mpa_v293_CHOCOPhlAn_201901.md5?dl=1
mpa_v293_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABFg8C3gMyNpYNTAY5PcSONa/mpa_v293_CHOCOPhlAn_201901.tar?dl=1
mpa_v294_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA6KBc-nd8_C4bJOPVjzWW7a/mpa_v294_CHOCOPhlAn_201901.md5?dl=1
mpa_v294_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAJSiG2qkfNMcYzhvz4Rxsga/mpa_v294_CHOCOPhlAn_201901.tar?dl=1
mpa_v295_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACIWLuO_0ixZsxiSZZG-4M1a/mpa_v295_CHOCOPhlAn_201901.md5?dl=1
mpa_v295_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA0y5qvPjJdxVw86Ovjinipa/mpa_v295_CHOCOPhlAn_201901.tar?dl=1
mpa_v296_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAbRMEnsewv_pHFsLv2Be0va/mpa_v296_CHOCOPhlAn_201901.md5?dl=1
mpa_v296_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABAD51gd3wr0___2MWO0xD-a/mpa_v296_CHOCOPhlAn_201901.tar?dl=1
mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAv_ShZiz7pNTaT_YONJTF7a/mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAlyQITZuUCtBUJxpxhIroIa/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v30_CHOCOPhlAn_201901.tar   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADlxibskzbPHPoDl6S-FyKka/mpa_v30_CHOCOPhlAn_201901.tar?dl=1
mpa_v30_CHOCOPhlAn_201901.md5   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACTzoUYDqZps8u2JqWCNCODa/mpa_v30_CHOCOPhlAn_201901.md5?dl=1
SRS019033.fastq https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACh1NQExDk39RXzZOyTPmQwa/SRS019033.fastq?dl=1
strainphlan_homebrew_counter.txt    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAZO5uydQ2QQkOSkYu5DxRha/strainphlan_homebrew_counter.txt?dl=1

Win下载:marker genes数据

# 解压 15M -> 336M
bzip2 -d mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
less -S mpa_v30_CHOCOPhlAn_201marker_info.txt | wc -l
# 1132198 # 113万行

一个物种有多个marker genes,因此总基因大于物种数

# 剖开看看
39444__GeneID:11747987  {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le
39444__GeneID:11747985  {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le
39444__GeneID:11747982  {'ext': ['PRJNA15142', 'PRJNA14595', 'PRJNA20981'], 'sco
39444__GeneID:11747981  {'ext': [], 'score': 0.0, 'clade': 's__Cotia_virus', 'le

bzip2压缩了20倍,牛皮呀呀呀呀呀呀

Win下载:基因组数据

tar -xvf mpa_v30_CHOCOPhlAn_201901.tar
rm mpa_v30_CHOCOPhlAn_201901.tar
bzip2 -d mpa_v30_CHOCOPhlAn_201901.fna.bz2
# 敲开kankan>1000373__GeneID:11569613
less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | head
# 是基因的序列信息
>1000373__GeneID:11604940
>1000373__GeneID:11604942
>1000373__GeneID:11604944
>100053__V6HSA9__LEP1GSC062_4244 UniRef90_V6HSA9;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HSV5__LEP1GSC062_0144 UniRef90_V6HSV5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HSX2__LEP1GSC062_1624 UniRef90_V6HSX2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HT46__LEP1GSC062_1702 UniRef90_V6HT46;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HT56__LEP1GSC062_1778 UniRef90_V6HT56;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HTZ5__LEP1GSC062_0133 UniRef90_V6HTZ5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | awk -F" " '{print $2}' | sed -e '/^$/d' | awk -F";" '{print $2}' | uniq  | wc -l
# 10164 行(物种)

bowtie2建数据库索引

bowtie2-build --threads 8 \
-f mpa_v30_CHOCOPhlAn_201901.fna \
mpa_v30_CHOCOPhlAn_201901
source /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
conda activate metaphlan
route="/hwfssz5/ST_META/TMP1T.2/soil"
metaphlan3db="/hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/metaphlan_v30/"
mkdir $route/Result/metaphlan3/$infile
cat $route/Result/kneaddata/$infile/${infile}_1_kneaddata.repeats.removed.*.fastq > $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq
metaphlan $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq \
--input_type fastq \
--bowtie2db $metaphlan3db \
-x mpa_v30_CHOCOPhlAn_201901 \
--nproc 8 \
-o $route/Result/metaphlan3/$infile/${infile}_metaphlan3.out
rm $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq

--input_type {fastq,fasta,bowtie2out,sam}
--bowtie2db METAPHLAN_BOWTIE2_DB
-x INDEX Specify the id of the database version to use
--bt2_ps BowTie2 presets [default very-sensitive]
--add_viruses Allow the profiling of viral organisms
--ignore_eukaryotes Do not profile eukaryotic organisms
--ignore_bacteria Do not profile bacterial organisms
--ignore_archaea Do not profile archeal organisms
--nproc N The number of CPUs to use for parallelizing the mapping [default 4]

qsub提交任务:

route="/hwfssz5/ST_META/TMP1T.2/soil"
qsub -cwd -l vf=80G,p=8 -q st_supermem.q -P st_supermem -binding linear:8 $route/script/q_metaphlan3.sh -i 49
conda activate metaphlan
python /route/metaphlan/bin/merge_metaphlan_tables.py \
-o ./abundance/abundance_merged.txt \
./abundance/*.out

unifrac样本距离

下载metaphlan nwk文件:
https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk

R安装依赖:

conda activate r411
conda install r-vegan r-ape r-rbiom

获取计算R脚本:
https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/calculate_unifrac.R

计算距离:

Rscript $unifrac_route/calculate_unifrac.R \
abundance_merged.txt \
$nwk_route/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk \
abundance_merged_unifrac_profiles.txt

警告nwk信息不全

WARNING: 4 species not present in the tree were removed from the input profile.
Nitrolancea_hollandica
Singulisphaera_acidiphila
Leptothrix_ochracea
Sutterella_parvirubra

hclust2.py画热图

conda activate -c bioconda hclust2
hclust2.py \
  -i abundance_merged.txt \