如何根据txt文件中的urls从多个页面中抓取文本主体

0 人关注

我试图写一段代码，调用几个URL，然后将整个刮来的文本保存在一个txt文件中，但我不知道在哪里实现一个循环函数而不破坏一切。

这就是代码现在的样子。

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()
def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() 
    return response.content
readMe = getReadMe()
print(readMe)
html = getHtml(readMe)
soup = BeautifulSoup(html, 'html.parser')
text = soup.find_all(text=True)
output =''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style'
for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
print(output)
with open("copy.txt", "w") as file:
    file.write(str(output))


         python


         python-3.x


         web-scraping


         beautifulsoup


        2
        
        个回答


          
           
           
            alphaBetaGamma
           
          
          
           发布于
           
           2020-11-24


          已采纳


         0
         
         人赞同


          import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()
def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() 
    return response.content
readMe = getReadMe()
print(readMe)
for line in readMe:
    html = getHtml(line)
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.find_all(text=True)
    output =''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style'
    for t in text:
        if t.parent.name not in blacklist:
            output += '{} '.format(t)
    print(output)
    #the option a makes u append the new data to the file
    with open("copy.txt", "a") as file:
        file.write(str(output))
试试这个，看看它是否有效。


           
            
             你好，非常感谢你的帮助，但我仍然得到一些错误，例如。  File "C:/Users/x/AppData/Local/Programs/Python/Python39/Web Srapping Loop V2.py", line 14, in getHtml response = requests.get(readMe,headers=header,timeout=3) "C:\Users\xAppData\Local\Programs\Python\Python39\lib\site-packages\requests-2.25.0-py3.9.ggrequests\models.py", 第390行, in prepare_url raise MissingSchema(error) requests.exceptions.MissingSchema:无效的URL 'h'。没有提供模式。也许你是指http://h？


          
           
            一个朋友的朋友给我提供了一个答案，它至少在英语尿液中是有效的。
一旦我找到解决德国尿的方法（现在它们崩溃得很厉害^^），我也会把它贴出来。
           
           import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getHtml(url):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(url,timeout=10)
    response.raise_for_status() 
    return response.content
with open('urls.txt','r') as fd:
    for i, line in enumerate(fd.readlines()):
        url = line.strip()
        print("scraping " + url + "...")
        html = getHtml(url)
        soup = BeautifulSoup(html, 'html.parser',)
        text = soup.find_all(text=True)(text.encode('utf-8').decode('ascii', 'ignore')
        output =''
        blacklist = [
            '[document]',
            'noscript',
            'header',
            'html',
            'meta',
            'head', 
            'input',
            'script',
            'style'
        for t in text:
            if t.parent.name not in blacklist:
                output += '{} '.format(t)