Collectives on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more
Ask Question
I'm trying to write a
scraper
, but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file,
python2.7
told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.
My code looks like this:
from urllib import FancyURLopener
import os
class MyOpener(FancyURLopener): #spoofs a real browser on Window
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
print "What is the webaddress?"
webaddress = raw_input("8::>")
print "Folder Name?"
foldername = raw_input("8::>")
if not os.path.exists(foldername):
os.makedirs(foldername)
def urlpuller(start, page):
while page[start]!= '"':
start += 1
close = start
while page[close]!='"':
close += 1
return page[start:close]
myopener = MyOpener()
response = myopener.open(webaddress)
site = response.read()
nexturl = ''
counter = 0
while(nexturl!=webaddress):
counter += 1
start = 0
for i in range(len(site)-35):
if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
start = i + 40
break
else:
print "Something's broken, chief. Error = 1"
next = 0
for i in range(start, 8, -1):
if site[i:i+8] == u'<a href=':
next = i
break
else:
print "Something's broken, chief. Error = 2"
nexturl = urlpuller(next, site)
myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')
print("Retrieval of "+foldername+" completed.")
When I try to run it using the site I'm using, it returns the error:
Traceback (most recent call last):
File "yada/yadayada/Python/scraper.py", line 37, in <module>
if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data
When pointed at http://google.com, it worked just fine.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
but when I try to decode using utf-8, as you can see, it does not work.
Any suggestions?
–
You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data
error.
Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.
–
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.