使用beautifulsoup和css选择器而不是lxml和xpath来刮取前面有特定元素的内容。

1 人不认可

我想从这个页面上抓取 "服务/产品 "部分。 https://www.yellowpages.com/deland-fl/mip/ryan-wells-pumps-20533306?lid=1001782175490

该文本在一个dd元素内，该元素总是排在后面。

Services/Products

I created the code to scrape this text, using lxml and xpath:

import requests
from lxml import html
url = ""
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)
t = html.fromstring(r.content)
products = t.xpath('//dd[preceding-sibling::dt[contains(.,"Services/Products")]]/text()[1]')[0] if t.xpath('//dd[preceding-sibling::dt[contains(.,"Services/Products")]]') else '' 
有什么办法可以用Beautifulsoup（和css选择器，如果可能的话）代替lxml和xpath获得相同的文本吗？


         python


         web-scraping


         beautifulsoup


         lxml


        
         
         
          max scender
         
        
        
         发布于
         
         2020-08-24


        2
        
        个回答


         0
         
         人赞同


          
           试试用BeautifulSoup和Requests工作。这要容易得多。
这里有一些代码
          
          # BeautifulSoup is an HTML parser. You can find specific elements in a BeautifulSoup object
from bs4 import BeautifulSoup
from requests import get
url = "https://www.yellowpages.com/deland-fl/mip/ryan-wells-pumps-20533306?lid=1001782175490"
obj = BeautifulSoup(get(url).content, "html.parser")
# Gets the section with the Services
buisness_info = obj.find("section", {"id":"business-info"})
# Getting all <dd> elements (cause you can pick off the one you need from the list)
all_dd = buisness_info.find_all("dd")
# Finds the specific tag with the text you need
services_and_products = all_dd[2]
# Gets the text
text = services_and_products.text
# All Done
print(text)


           
            
             max scender
            
            ：


           
            
             我不想按位置获取元素，all_dd[2]在其他页面上不会起作用，因为各页的位置是不同的


          
           
            
            
             Jack Fleeting
            
           
           
            发布于
            
            2020-08-24


          
           
            在你的网页上试试这样的东西。
           
           inf = soup.select_one('section#business-info dl')
target = inf.find("dt", text='Services/Products').nextSibling
for t in target.stripped_strings: