1.问题导向

最近在做某个课题的时候，按老师的要求需要从NCBI中批量下载不同物种的参考基因组，同时收集相应参考基因组的一些组装信息，基因组非常多，导致工作量巨大，一个一个手动收集的话，既费时又费力，这时就想到了用python爬虫来完成这项任务。

2.爬虫思路

2.1找到所需爬取的网页并观察网址urls的异同点

以猪、马、牛、羊参考基因组为例：

# Sus scrofa (pig)
https://www.ncbi.nlm.nih.gov/assembly/GCA_000003025.6
# Equus caballus (horse)
https://www.ncbi.nlm.nih.gov/assembly/GCF_002863925.1
# Bos taurus (cattle)
https://www.ncbi.nlm.nih.gov/assembly/GCF_002263795.1
# Ovis aries (sheep)
https://www.ncbi.nlm.nih.gov/assembly/GCF_016772045.1
......
urls = "https://www.ncbi.nlm.nih.gov/assembly/{assembly_ID}"

NCBI中的参考基因组大部分是按照GenBank assembly accession号来存放位置的，因此我们只需要得到所需要收集物种的登录号，即可找到对应参考基因组的组装信息的页面。

2.2确认所需爬取的信息并确认是否需要二次爬取

此处，需要爬取的信息共分为三部分，分别为上图红框中部分：

第一部分为每个assembly的基本信息，按照自己的需要选择内容，如assembly name、Organism name、Genome coverage等。
第二部分为每个assembly的组装信息，主要反映assembly的组装质量，建议全都收集。
第三部分为常规下载的FTP地址，用来存放供下载的参考基因组、CDS序列、或注释文件GFF、GTF等文件，因为其拥有独立的网址url，需要二次爬取。新页面如下图所示：
如下图。本文主要下载参考基因组，即.fna文件，可按需要下载蛋白.faa、注释文件.gff或.gtf文件等。

2.3 在网页源代码中搜索定位所需要的信息

通过鼠标右键或快捷键"CTRL+U"来调出网页源代码，并利用"CTRL+F"来快速定位自己所需要爬取的内容的位置，如下：

第一部分：assembly基本信息

<div><div><div id="summary_cont"><div class="col margin_r0 nine_col"><div id="summary"><h1 xmlns:math="http://exslt.org/math" class="marginb0 margin_t0">Sscrofa11.1</h1><input type="hidden" value="true" id="ftp-genbank-refseq-exist" /><dl xmlns:math="http://exslt.org/math" class="assembly_summary_new margin_t0"><dt>Description: </dt><dd>Sscrofa11 with Y sequences from WTSI_X_Y_pig V2</dd><dt>Organism name: </dt><dd><a href="/Taxonomy/Browser/wwwtax.cgi?mode=Info&amp;id=9823&amp;lvl=3&amp;lin=f&amp;keep=1&amp;srchmode=1&amp;unlock"><span class="highlight" style="background-color:">Sus scrofa</span> (pig)</a></dd><dt>Infraspecific name: </dt><dd>Breed: Duroc</dd><dt>Isolate: </dt><dd>TJ Tabasco</dd><dt>Sex: </dt><dd>female</dd><dt>BioSample: </dt><dd><a href="/biosample/SAMN02953785/">SAMN02953785</a></dd><dt>BioProject: </dt><dd><a href="/bioproject/PRJNA13421/">PRJNA13421</a></dd><dt>Submitter: </dt><dd>The Swine Genome Sequencing Consortium (SGSC)</dd><dt>Date: </dt><dd>2017/02/07</dd><dt>Synonyms: </dt><dd>susScr11</dd><dt>Assembly level: </dt><dd>Chromosome</dd><dt>Genome representation: </dt><dd>full</dd><dt>RefSeq category: </dt><dd>representative genome</dd><dt>GenBank assembly accession: </dt><dd>GCA_000003025.6 (<span class="highlight" style="background-color:">latest</span>)</dd><dt>RefSeq assembly accession: </dt><dd>GCF_000003025.6 (<span class="highlight" style="background-color:">latest</span>)</dd><dt>RefSeq assembly and GenBank assembly identical: </dt><dd>no (<a href="#assembly-diff" id="assembly-diff-trigger">hide details</a>)</dd><dd id="assembly-diff"><ul><li>Only in RefSeq: chromosome MT (in non-nuclear assembly-unit)</li></ul></dd><dd class="displayed-from-refseq"><ul style="margin-left:0;"><li>Data displayed for RefSeq version</li></ul></dd><dt>WGS Project: </dt><dd><a href="/nuccore/AEMK00000000.2/">AEMK02</a></dd><dt>Assembly method: </dt><dd>Falcon v. OCT-2015</dd><dt>Expected final version: </dt><dd>yes</dd><dt>Genome coverage: </dt><dd>65.0x</dd><dt>Sequencing technology: </dt><dd>PacBio</dd></dl><div xmlns:math="http://exslt.org/math" style="clear:both"></div><p style="color:grey;"><span>IDs: </span><span>1004191 [UID] </span><span>4121818 [GenBank] </span><span>4192498 [RefSeq] </span></p></div></div><div class="more_genome_data-cont"><div class="more_genome_data shadow margin_r1"><h3>See <a href="/genome/?term=txid9823[orgn]">Genome</a> Information for
                    <em><span class="highlight" style="background-color:">Sus scrofa</span></em></h3></div><div class="more_genome_data shadow margin_r1 links_to_isolate" data-accession="GCA_000003025.6"><h3>Pathogen Detection Resources</h3><ul><li><a href="#" id="link-to-isolate">Isolate Browser </a></li><li><a href="#" id="link-to-snp-tree">SNP Tree Viewer</a></li></ul></div><div class="more_genome_data genome_nav shadow margin_r1"><h3>There are 26 assemblies for this organism</h3><a href="/assembly/organism/9823/latest/">See more</a></div></div><div id="asm_history_cont"><div id="asm_history_cont"><h2 class="sec_header margin_b0 rev_history_tg" id="revision-history">History <a href="#" class="jig-ncbitoggler" data-ncbitoggler-toggles="asb_history">(Show
                        revision history)</a></h2><div class="jig-ncbigrid asb_history" id="asb_history" style="display: none;"><table class="margin_t0 "><thead><tr><th>GenBank Assembly<br />Accession</th><th></th><th>RefSeq Assembly<br />Accession</th><th>Assembly<br /> Name</th><th>Assembly<br />Level</th><th>Status</th></tr></thead><tbody><tr class="current_asm"><td><a href="https://www.ncbi.nlm.nih.gov/assembly/1004191/" target="_blank">GCA_000003025.6</a></td><td>≠</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/1004191/" target="_blank">GCF_000003025.6</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.6/">Sscrofa11.1</a></td><td>Chromosome</td><td><span class="highlight" style="background-color:">Latest</span> GenBank, <span class="highlight" style="background-color:">Latest</span> RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/905331/" target="_blank">GCA_000003025.5</a></td><td><span class="note_gray">n/a</span></td><td><span class="note_gray">n/a</span></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_000003025.5/">Sscrofa11</a></td><td>Chromosome</td><td>Replaced GenBank</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/304498/" target="_blank">GCA_000003025.4</a></td><td>≠</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/304498/" target="_blank">GCF_000003025.5</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.5/">Sscrofa10.2</a></td><td>Chromosome</td><td>Replaced GenBank, Replaced RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/284398/" target="_blank">GCA_000003025.3</a></td><td>≠</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/284398/" target="_blank">GCF_000003025.4</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.4/">Sscrofa10</a></td><td>Chromosome</td><td>Replaced GenBank, Replaced RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/111518/" target="_blank">GCA_000003025.2</a></td><td>=</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/111518/" target="_blank">GCF_000003025.3</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.3/">Sscrofa9.2</a></td><td>Chromosome</td><td>Replaced GenBank, Replaced RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/5178/" target="_blank">GCA_000003025.1</a></td><td><span class="note_gray">n/a</span></td><td><span class="note_gray">n/a</span></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_000003025.1/">Sscrofa9</a></td><td>Chromosome</td><td>Replaced GenBank</td></tr><tr><td><span class="note_gray">n/a</span></td><td><span class="note_gray">n/a</span></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/4418/" target="_blank">GCF_000003025.2</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.2/">Sscrofa5</a></td><td>Chromosome</td><td>Replaced RefSeq</td></tr></tbody></table></div></div></div><div id="asm_comment_cont"><div id="asm_comment_cont"><h2 class="sec_header margin_b0">Comment</h2><pre class="asm_comment_text"><span class="asm_comment_visible">This pig genome sequence (Sscrofa11) has been released by the International Swine Genome Sequencing Consortium under the terms of the Toronto Statement (Nature 2009, 461: 168). The Consortium is coordinating genome-wide analysis, annotation and publication. 
第二部分：assembly组装信息  
The sequence data from </span><span class="asm_comment_dot">... </span><span class="asm_comment_more">which this assembly was constructed largely comprise 65x genome coverage in whole genome shotgun (WGS) Pacific Biosciences long reads (Pacific Biosciences RSII, with P6/C4 chemistry). Illumina HiSeq2500 WGS paired-end and mate pair reads were used for final error correction using PILON. Sanger and Oxford Nanopore sequence data from a few CHORI-242 BAC clones were used to fill gaps. <span class="highlight" style="background-color:">All</span> the WGS data were generated from a single Duroc female (TJ Tabasco, also known as Duroc 2-14) which was also the source of DNA for the CHORI-BAC library. 
Sscrofa11 replaces the previous assembly, Sscrofa10.2, which was largely established from the same Duroc 2-14 DNA source. </span> <a href="#" class="asm_comment_more">more</a></pre><div></div></div></div><div id="global-stats"><div id="global-stats"><h2 class="margin_b0 sec_header">Global statistics</h2><table summary="Global statistics" class="margin_t0 jig-ncbigrid"><tbody><tr><td>Number of regions with alternate loci or patches</td><td class="align_r">2</td></tr><tr><td>Total sequence length</td><td class="align_r">2,501,912,388</td></tr><tr><td>Total ungapped length</td><td class="align_r">2,472,047,747</td></tr><tr><td>Gaps between scaffolds</td><td class="align_r">93</td></tr><tr><td>Number of scaffolds</td><td class="align_r">706</td></tr><tr><td>Scaffold N50</td><td class="align_r">88,231,837</td></tr><tr><td>Scaffold L50</td><td class="align_r">9</td></tr><tr><td>Number of contigs</td><td class="align_r">1,118</td></tr><tr><td>Contig N50</td><td class="align_r">48,231,277</td></tr><tr><td>Contig L50</td><td class="align_r">15</td></tr><tr><td>Total number of chromosomes and plasmids</td><td class="align_r">21</td></tr><tr><td>Number of component sequences (WGS or clone)</td><td class="align_r">1,308</td></tr></tbody></table></div></div></div></div><input id="asm-has-egap-annot" type="hidden" value="true" /><script type="text/javascript" src="/projects/genome/uud/js/uud.js"></script><script src="/projects/genome/trackmgr/0.7/js/tms.js"></script></div>
            <div id="messagearea_bottom"> 
第三部分：独立的下载网址url 
      <a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1">FTP directory for RefSeq assembly</a>
      <a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/003/025/GCA_000003025.6_Sscrofa11.1">FTP directory for GenBank assembly</a>
第四部分：独立的下载网址url 
# 网页标题index：
/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1
# 网页urls（以pig的RefSeq为例）：
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/
# 参考基因组下载链接
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_genomic.fna.gz
# 通过总结得出通用下载链接为：
https://ftp.ncbi.nlm.nih.gov + /genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1 + /GCF_000003025.6_Sscrofa11.1 + _genomic.fna.gz 
3.代码实现 
3.1 提供初始遍历文件assemble_list.txt 
        每行记录了一个所需物种的assemble号，可根据需求自己批量查找。 
GCA_000003025.6
GCF_002863925.1
GCF_002263795.1
GCF_016772045.1
GCF_003369695.1
GCF_000247795.1
#网页urls
url=str("https://www.ncbi.nlm.nih.gov/assembly/")+str(sample) 
3.2 请求部分：发送请求获取网页源代码 
def information_collect(url):
    headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                       "(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
    response=requests.get(url,headers=headers)
    page_content=response.text 
3.3 处理部分：写正则表达式进行匹配 
3.3.1 第一部分：参考基因组发布信息的匹配 
## 这里发现一些assembly缺少检测平台信息，故进行细分
    jiexi_1_1=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>',re.S)
    result_1_1=jiexi_1_1.findall(page_content)
    jiexi_1_2=re.compile(r'Sequencing technology.*?<dd>(.*?)</dd>',re.S)
    result_1_2=jiexi_1_2.findall(page_content)
    if result_1_2==[] : result_1_2="Na"
    result_1 = list_change(result_1_1) + result_1_2[0] + "\t"
    # 这里用了一个自己写的函数list_change，作用是将输入的列表转化为\t分割的字符串
3.3.2 第二部分：参考基因组装信息的匹配 
#参考基因组的组装信息
    jiexi_2=re.compile(r'Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)
    result_2=jiexi_2.findall(page_content)
    result=str(result_1)+list_change(result_2)
#另一种情况
    #jiexi=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>.*?Sequencing technology.*?<dd>(.*?)</dd>.*?Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)
    #result=jiexi.findall(page_content) 
3.3.3 第三部分：下载链接部分的匹配 
    ##进一步爬取下载链接
    #xiazai=re.compile(r'Statistics report</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S) 
    xiazai=re.compile(r'FTP directory for RefSeq assembly</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S)
    xiazai_url=xiazai.findall(page_content)
    #print(xiazai_url)
    xiazai_response=requests.get(xiazai_url[0],headers=headers)
    xiazai_page_content=xiazai_response.text
    # print(xiazai_page_content)
    xiazai_url_jiexi_1=re.compile(r'<title>Index of (.*?)</title>',re.S)
    xiazai_url_1=xiazai_url_jiexi_1.findall(xiazai_page_content)
    #print(xiazai_url_1)
    xiazai_url_jiexi_2=re.compile(r'fna.gz">(.*?)</a>',re.S)
    xiazai_url_2=xiazai_url_jiexi_2.findall(xiazai_page_content)
    #print(xiazai_url_2)
    ###组合下载链接
    final_url=str("https://ftp.ncbi.nlm.nih.gov"+str(xiazai_url_1[0]+'/'+xiazai_url_2[0]))
    information=str(result+final_url) 
3.4 主程序部分： 
# 组装主程序
if __name__ == '__main__':
    all_sample_lists=sample_list("assemble_list.txt") # sample_list转换Linux文件到pyton列表
    for sample in all_sample_lists: # 遍历索引
        url=url_get(sample) # urls获取函数
        save_information=information_collect(url) # 处理部分函数
        #print(save_information)
        information_save(save_information) # 保存函数
    print("over") 
4. 组合各部分代码： 
#导入模块
import os
import requests
import re
import csv
def sample_list(list_path):   #输入所要使用的样本列表的Linux路径
    with open(list_path,'r') as slist:
        sample_lists=[]
        samplelists=slist.readlines()
        for sample in samplelists:
            sample=sample.strip('\n')  #去掉每个元素后的"\n",避免报错！
            sample_lists.append(sample)
        return(sample_lists)  #返回样本列表
def url_get(sample):  # urls获取函数
    url=str("https://www.ncbi.nlm.nih.gov/assembly/")+str(sample)
    return(url)
def information_collect(url): #处理函数
    headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                       "(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
    response=requests.get(url,headers=headers)
    page_content=response.text
    # all_information=re.compile(r'<div id="summary_cont">.*?<div id="messagearea_bottom">',re.S).findall(page_content)
    #参考基因组发布信息
#     jiexi_1=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>.*?Sequencing technology.*?<dd>(.*?)</dd>',re.S)
#     result_1=jiexi_1.findall(page_content)
    ##细分检测平台
    jiexi_1_1=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>',re.S)
    result_1_1=jiexi_1_1.findall(page_content)
    #print(result_1_1)
    #print(list_change(result_1_1))
    jiexi_1_2=re.compile(r'Sequencing technology.*?<dd>(.*?)</dd>',re.S)
    result_1_2=jiexi_1_2.findall(page_content)
    #print(result_1_2)
    if result_1_2==[]:result_1_2=" "
    result_1=list_change(result_1_1)+result_1_2[0]+"\t"
    #print(result_1)
    #参考基因组的组装信息
    jiexi_2=re.compile(r'Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)
    result_2=jiexi_2.findall(page_content)
    result=str(result_1)+list_change(result_2)
    #jiexi=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>.*?Sequencing technology.*?<dd>(.*?)</dd>.*?Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)
    #result=jiexi.findall(page_content)
    ##进一步爬取下载链接
    #xiazai=re.compile(r'Statistics report</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S)
    xiazai=re.compile(r'FTP directory for RefSeq assembly</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S)
    xiazai_url=xiazai.findall(page_content)
    #print(xiazai_url)
    xiazai_response=requests.get(xiazai_url[0],headers=headers)
    xiazai_page_content=xiazai_response.text
    # print(xiazai_page_content)
    xiazai_url_jiexi_1=re.compile(r'<title>Index of (.*?)</title>',re.S)
    xiazai_url_1=xiazai_url_jiexi_1.findall(xiazai_page_content)
    #print(xiazai_url_1)
    xiazai_url_jiexi_2=re.compile(r'fna.gz">(.*?)</a>',re.S)
    xiazai_url_2=xiazai_url_jiexi_2.findall(xiazai_page_content)
    #print(xiazai_url_2)
    ###组合下载链接
    final_url=str("https://ftp.ncbi.nlm.nih.gov"+str(xiazai_url_1[0]+'/'+xiazai_url_2[0]))
    information=str(result+final_url)
    return(information)
def information_save(save_information):  # 保存函数
    with open('./information_collection.txt',mode='a') as file:
            file.writelines(save_information)
            file.writelines('\n')
def list_change(save_list): ## sample_list转换Linux文件到pyton列
    list2=[]
    for o in save_list:
        for i in o:
            list2.append(i)
    inf=''
    for i in list2:
        inf=inf+i+'\t'
    #print(inf)
    return(inf)
# 组装主程序
if __name__ == '__main__':
    all_sample_lists=sample_list("assemble_list.txt") # sample_list转换Linux文件到pyton列表
    for sample in all_sample_lists: # 遍历索引
        url=url_get(sample) # urls获取函数
        save_information=information_collect(url) # 处理部分函数
        #print(save_information)
        information_save(save_information) # 保存函数
    print("over") 
5.效果展示 
$/实验记录本/爬虫实战/NCBI信息获取/NCBI_collection.py
ARS-UCD1.2 - bosTau9    Bos taurus (cattle)     USDA ARS        2018/04/11     GCA_002263795.2 (latest)                                          PacBio; Illumina NextSeq 500; Illumina HiSeq; Illumina GAII      2,715,853,792  2,715,825,630                                                     02,211   103,308,737     12      2,597   25,896,116      32      31      2,211  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/263/795/GCA_002263795.2_ARS-UCD1.2/GCA_002263795.2_ARS-UCD1.2_genomic.fna.gz
UOA_Brahman_1   Bos indicus x Bos taurus (hybrid cattle)        University of Adelaide                                                           2018/11/30       GCA_003369695.2 (latest)        PacBio Sequel; PacBio RSII; Illumina NextSeq                                                     2,680,953,056    2,679,316,559   0       1,250   104,466,507     11      1,552  26,764,281                                                        32       30      1,250   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/369/695/GCA_003369695.2_UOA_Brahman_1/GCA_003369695.2_UOA_Brahman_1_genomic.fna.gz
Bos_indicus_1.0 Bos indicus (zebu cattle)       Genoa Biotecnologia SA         2014/11/25                                                        GCA_000247795.2 (latest) SOLiD   2,673,965,444   2,475,828,999   0       32     106,310,653                                                       11       253,770 28,375  25,227  32      253,770 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/247/795/GCA_000247795.2_Bos_indicus_1.0/GCA_000247795.2_Bos_indicus_1.0_genomic.fna.gz
        最后的输出结果也可以直接导入excel中进一步处理，获取的下载链接也可以在linux中用wget批量进行下载，节省了很多时间。但其实本文的代码，还有许多可以改进的地方，比如：在正则匹配处，对于缺失信息的处理还不到位，如果某个参考基因组缺少部分信息，就会导致程序报错，可以分开匹配，再加上条件判断就可以解决等。对于python爬虫，完全是个人感兴趣而自学的，开始也是什么都不会，先找别人的项目进行学习和练手，最后到自己亲自动手实践，完成下来还是有一些成就感的。学无止境，加油！
                    最近在做某个课题的时候，按老师的要求需要从NCBI中批量下载不同物种的参考基因组，同时收集相应参考基因组的一些组装信息，基因组非常多，导致工作量巨大，一个一个手动收集的话，既费时又费力，这时就想到了用python爬虫来完成这项任务。本文主要介绍自己在接到任务后的思考和处理思路，仅代表个人观点，作为爬虫的练习。
				根据accession number从NCBI下载FASTQ/FASTA格式的测序数据（pig）
1. 打开NCBI（https://www.ncbi.nlm.nih.gov/），输入accession number搜索，我查阅一些文献是关于通城猪的（SRX510749
）（Li X, et al. 2014. Genome-wide scans to detect positive sele...
				相信大家在上一文中下载fasta的时候还没有感觉到下载是多么复杂，但是对于分析比对多个序列文件时，这个工作量说多了都是泪。比如，老板让你比对自己测定序列与 NCBI 库中序列，并构建相应的进化树，而这个序列需要大于100条。我想你的心情不会和下载一条序列时那么平静，那么，接下来通过BioPython提供的接口来实现快速的自动化序列下载。
一、Entrez 库
1.1 Entrez 介绍
Entr...
准备后端数据库
注意：请定期执行此步骤，以了解NCBI分类标准。
 从ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/下载taxdmp.zip并将其解压缩。
 python prepyphy.py [ncbi_download_folder] [db_path]
此步骤将在pyphy.config中自动设置“ db_path”。
 如果之后移动数据库，请不要忘记刷新pyphy.config中的db_path。
使用图书馆
建立数据库后，您可以通过以下方式将pyphy库导入Python代码中：
 import pyphy
pyphy在NCBI分类法中提供了以下查询：
生信必须了解的4种文件格式
在做生物信息的过程中，经常需要进行各种文件格式。每一种生物软件都有固定的文件格式要求。因此，需要非常每一种数据的文件格式，从某种意义上来说，生物信息分析的过程就是进行各种文件格式的转换过程。例如当前很多分析都可以概括为从fastq到bam，从bam到vcf的过程。
fasta文件格式
FASTA文件主要用于存储生物的序列文件，例如基因组，基因的核酸序列以及氨基酸等，是最常见的生物序列格式，一般扩展名为
				在海量的组学数据中，我们经常需要根据已有的差异表达基因找到对应的注释信息。那么针对一系列基因ID批量获取其注释无疑能够大大简化后继的分析，提高科研效率。本次来分享使用python爬虫完成NCBI基因注释的方法。
Sample input： 输入文件如下，是一列geneID。
待获取的信息来源于NCBI-geneID页中Description项，也就是下图中红色方框项：
Sample output： 最终输出结果如下
下面讲解一下思路流程：
1. 逐行读取xls文件列名并获取基因ID。
2. NCB
				GenomeDownloader 是一个命令行 Perl 程序，用于从 NCBI 下载基因组数据（使用 wget）。 最近（2017-10）已完全重写，以与 NCBI 的“新”数据组织结构一起使用。 也可以选择组装完成水平（即 Contig、Scaffold、Chromosome 或 Complete Genome）作为下载数据的标准。 基因组数据可以从属于某个分类群（例如哺乳动物或 40674）的所有生物体中下载，并且下载可以限于某些类型的文件（例如，faa 或 faa、gbff 等）。 搜索词也可用于进一步限制结果。 该程序在 Linux 上运行，但也可以在 Mac OS 上运行（可能需要做一些修改，并且满足相关性）。
二、如何下载参考基因组
在 linux 中下载参考序列数据库：
1. hg38：wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
2. hg19：wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
# 下载时间会比
 Edge使基因组与衍生自其的子基因组之间保持结构变化。用户通过将基于序列的操作（例如同源重组）应用于亲本基因组来创建修饰的基因组。用户可以注释或校正基因组上的序列； Edge会自动将更改应用于派生的基因组上的适当区域。 Edge有效地做到了这一点：在亲本基因组上进行更改需要O（1），并自动传播到修饰的基因组。
 Edge为每个修饰的基因组使用O（D）的存储量，其中D是修饰的基因组与其亲本之间的差异数。当前的实现另外还保留了一个对基对编号的注释的高速缓存，但是该高速缓存是软数据，并且被无效并根据需要重新构建。
可以通过将操作重新应用于新基因组来重新创建修饰的基因组（请考虑git rebase）。但是，目前，注释基因组不是一项操作。同样，对基因组进行两次相同的操作会导致单个子基因组，而不是两个。
 Edge提供了用于查看操作和更改的UI以及用于进行更改的API。 Edge可以将基因组序
最近跟着黑马程序员在学request爬虫，成功完成NCBI文献的批量处理，出现的问题是爬取效率实在太慢了，因此听了多线程爬虫课程，把自己代码改编之后，在此记录一下。
提示：本文主要对queue.Queue队列和**threading.Thread()**模块进行详细介绍。
一、queue是什么？
queue模块在我的理解中是一
				最近有一大堆质谱数据，但好多蛋白都是已经研究过得，为了寻找和bait蛋白相关，但又特异的假定蛋白，决定写个Python程序过滤掉不需要的信息，保留想要的信息。
1，找出所有质谱数据中特异蛋白中的假定蛋白并按得分高低排序。
2，根据蛋白序列号找出假定蛋白可能含有的结构域，写入excel文件。
3，说干就干
第一步主要用集合的性质去重，用re正则表达式找出序列号，用open...
# 1.逐行读取xls文件列名并获取基因ID
def read_xlsx(path, sheetname,i):
    sheet = pd.read_excel(path, sheetname)
    geneID = []
    for row in sheet.index.values:
        geneID.append(sheet.iloc[row, i-1])
                    远昼291: 
                    有个报错：runfile('G:/工作目录/基因组数据/未命名6.py', wdir='G:/工作目录/基因组数据')
Traceback (most recent call last):
  File "G:\工作目录\基因组数据\未命名6.py", line 98, in <module>
    save_information=information_collect(url) # 处理部分函数
  File "G:\工作目录\基因组数据\未命名6.py", line 62, in information_collect
    xiazai_response=requests.get(xiazai_url[0],headers=headers)
IndexError: list index out of range
                [爬虫实战]利用python根据样本ID快速收集对应样本的相关信息
                    caishiyouchabie: 
                    还有就是，这个程序爬取NCBI久了，有可能会出现请求ncbi网页未答复的情况，这种bug能解决吗
                [爬虫实战]利用python根据样本ID快速收集对应样本的相关信息
                    caishiyouchabie: 
                    博主你好，我在用你程序爬取NCBI网页时发现，爬取成功了几个之后，往往会出现正则表达式匹配不到了的情况。我仔细对比了request得到的网页源代码和直接ctrl+u获得的源代码，发现两者不一致，这是正则表达式匹配不到的根本原因，不知道你有没有遇到过，有解决的方案不。 非常感谢