添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

如何根据txt文件中的urls从多个页面中抓取文本主体

0 人关注

我试图写一段代码,调用几个URL,然后将整个刮来的文本保存在一个txt文件中,但我不知道在哪里实现一个循环函数而不破坏一切。

这就是代码现在的样子。

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()
def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() 
    return response.content
readMe = getReadMe()
print(readMe)
html = getHtml(readMe)
soup = BeautifulSoup(html, 'html.parser')
text = soup.find_all(text=True)
output =''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style'
for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
print(output)
with open("copy.txt", "w") as file:
    file.write(str(output))
    
python
python-3.x
web-scraping
beautifulsoup
Reijarmo
Reijarmo
发布于 2020-11-23
2 个回答
alphaBetaGamma
alphaBetaGamma
发布于 2020-11-24
已采纳
0 人赞同
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()
def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() 
    return response.content
readMe = getReadMe()
print(readMe)
for line in readMe:
    html = getHtml(line)
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.find_all(text=True)
    output =''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style'
    for t in text:
        if t.parent.name not in blacklist:
            output += '{} '.format(t)
    print(output)
    #the option a makes u append the new data to the file
    with open("copy.txt", "a") as file:
        file.write(str(output))

试试这个,看看它是否有效。

你好,非常感谢你的帮助,但我仍然得到一些错误,例如。 File "C:/Users/x/AppData/Local/Programs/Python/Python39/Web Srapping Loop V2.py", line 14, in getHtml response = requests.get(readMe,headers=header,timeout=3) "C:\Users\xAppData\Local\Programs\Python\Python39\lib\site-packages\requests-2.25.0-py3.9.ggrequests\models.py", 第390行, in prepare_url raise MissingSchema(error) requests.exceptions.MissingSchema:无效的URL 'h'。没有提供模式。也许你是指http://h?
Reijarmo
Reijarmo
发布于 2020-11-24
0 人赞同

一个朋友的朋友给我提供了一个答案,它至少在英语尿液中是有效的。 一旦我找到解决德国尿的方法(现在它们崩溃得很厉害^^),我也会把它贴出来。

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getHtml(url):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(url,timeout=10)
    response.raise_for_status() 
    return response.content
with open('urls.txt','r') as fd:
    for i, line in enumerate(fd.readlines()):
        url = line.strip()
        print("scraping " + url + "...")
        html = getHtml(url)
        soup = BeautifulSoup(html, 'html.parser',)
        text = soup.find_all(text=True)(text.encode('utf-8').decode('ascii', 'ignore')
        output =''
        blacklist = [
            '[document]',
            'noscript',
            'header',
            'html',
            'meta',
            'head', 
            'input',
            'script',
            'style'
        for t in text:
            if t.parent.name not in blacklist:
                output += '{} '.format(t)