UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

Ask Question

I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

My code looks like this:

from urllib import FancyURLopener
import os
class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
print "What is the webaddress?"
webaddress = raw_input("8::>")
print "Folder Name?"
foldername = raw_input("8::>")
if not os.path.exists(foldername):
    os.makedirs(foldername)
def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]
myopener = MyOpener()
response = myopener.open(webaddress)
site = response.read()
nexturl = ''
counter = 0
while(nexturl!=webaddress):
   counter += 1
   start = 0
   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"
   next = 0
   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"
   nexturl = urlpuller(next, site)
   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')
print("Retrieval of "+foldername+" completed.")
When I try to run it using the site I'm using, it returns the error:
Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data
When pointed at http://google.com, it worked just fine.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
but when I try to decode using utf-8, as you can see, it does not work.
Any suggestions?
                @Daniel I read the documentation, but I'm unclear as to how to decode the site once I've opened it.
– user3701032
                Jun 2, 2014 at 23:44
You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.
Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.
                You would need some kind of stream utf8 decoder so that you know when you can break off your string. Alternatively you can decode the whole page at once (don't split up your string)
– Martin Konecny
                Jun 2, 2014 at 23:42
                I'm trying to use BeautifulSoup now. What would I do to find the img with the ID imgSized?
– user3701032
                Jun 3, 2014 at 0:00
                I'm able to search img, but I'm not sure why it's having problems with the tags. I was able to isolate the image I need, but ideally I'd like to be able to search for the link associated with the mouse over text as well.
– user3701032
                Jun 3, 2014 at 0:52
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.