Python: Ignore 'Incorrect padding' error when base64 decoding

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have some data that is base64 encoded that I want to convert back to binary even if there is a padding error in it. If I use
base64.decodestring(b64_string)
it raises an 'Incorrect padding' error. Is there another way?
UPDATE: Thanks for all the feedback. To be honest, all the methods mentioned sounded a bit hit
and miss so I decided to try openssl. The following command worked a treat:
openssl enc -d -base64 -in b64string -out binary_data
                Did you actually TRY using base64.b64decode(strg, '-_')? That is a priori, without you bothering to supply any sample data, the most likely Python solution to your problem. The "methods" proposed were DEBUG suggestions, NECESSARILY "hit and miss" given the paucity of the information supplied.
– John Machin
                May 31, 2010 at 22:03
                @John Machin: Yes, I did TRY your method but it didn't work. The data is company confidential.
– FunLovinCoder
                Jun 1, 2010 at 10:56
                Could you provide the output of this: sorted(list(set(b64_string))) please?  Without revealing anything company-confidential, that should reveal which characters were used to encode the original data, which in turn may supply enough information to provide a non-hit-or-miss solution.
– Brian Carcich
                Feb 2, 2019 at 17:19
                Yes, I know it's already solved, but, to be honest, the openssl solution also sounds hit-or-miss to me.
– Brian Carcich
                Feb 2, 2019 at 17:25
It seems you just need to add padding to your bytes before decoding. There are many other answers on this question, but I want to point out that (at least in Python 3.x) base64.b64decode will truncate any extra padding, provided there is enough in the first place.
So, something like: b'abc=' works just as well as b'abc==' (as does b'abc=====').
What this means is that you can just add the maximum number of padding characters that you would ever need—which is two (b'==')—and base64 will truncate any unnecessary ones.
This lets you write:
base64.b64decode(s + b'==')
which is simpler than:
base64.b64decode(s + b'=' * (-len(s) % 4))
Note that if the string s already has some padding (e.g. b"aGVsbG8="), this approach will only work if the validate keyword argument is set to False (which is the default). If validate is True this will result in a binascii.Error being raised if the total padding is longer than two characters.
From the docs:
If validate is False (the default), characters that are neither in the normal base-64 alphabet nor the alternative alphabet are discarded prior to the padding check.  If validate is True, these non-alphabet characters in the input result in a binascii.Error.
However, if validate is False (or left blank to be the default) you can blindly add two padding characters without any problem. Thanks to eel ghEEz for pointing this out in the comments.
                Okay that's not too "ugly" thanks :) By the way I think you never need more than 2 padding chars. Base64 algorithm works on groups of 3 chars at a time and only needs padding when your last group of chars is only 1 or 2 chars in length.
– Otto
                Nov 13, 2018 at 14:02
                @Otto the padding here is for decoding, which works on groups of 4 chars. Base64 encoding works on groups of 3 chars :)
– Henry Woody
                Dec 25, 2018 at 7:21
                but if you know that during encoding maximally 2 will ever be added, which may become "lost" later, forcing you to re-add them before decoding, then you know you will only need to add maximally 2 during decoding too. #ChristmasTimeArgumentForTheFunOfIt
– Otto
                Dec 27, 2018 at 10:08
                @Otto I believe you are right. While a base64 encoded string with length, for example, 5 would require 3 padding characters, a string of length 5 is not even a valid length for a base64 encoded string. You'd get the error: binascii.Error: Invalid base64-encoded string: number of data characters (5) cannot be 1 more than a multiple of 4. Thanks for pointing this out!
– Henry Woody
                Jan 1, 2019 at 19:15
                @HenryWoody, grep -A23 "def b64decode" /usr/lib/python3.10/base64.py github.com/python/cpython/blob/v3.10.9/Lib/base64.py#L85 shows a regex b'[A-Za-z0-9+/]*={0,2}' followed by a raise. A newer version may have a similar strict behaviour, github.com/python/cpython/blob/a87c46e/Modules/binascii.c#L427
– eel ghEEz
                Jan 6 at 19:02
As said in other responses, there are various ways in which base64 data could be corrupted.
However, as Wikipedia says, removing the padding (the '=' characters at the end of base64 encoded data) is "lossless":
  From a theoretical point of view, the padding character is not needed,
  since the number of missing bytes can be calculated from the number
  of Base64 digits.
So if this is really the only thing "wrong" with your base64 data, the padding can just be added back. I came up with this to be able to parse "data" URLs in WeasyPrint, some of which were base64 without padding:
import base64
import re
def decode_base64(data, altchars=b'+/'):
    """Decode base64, padding being optional.
    :param data: Base64 data as an ASCII byte string
    :returns: The decoded byte string.
    data = re.sub(rb'[^a-zA-Z0-9%s]+' % altchars, b'', data)  # normalize
    missing_padding = len(data) % 4
    if missing_padding:
        data += b'='* (4 - missing_padding)
    return base64.b64decode(data, altchars)
Tests for this function: weasyprint/tests/test_css.py#L68
                To clarify on @ariddell comment base64.decodestring has been deprecated for base64.decodebytes in Py3 but for version compatibility better to use base64.b64decode.
– Cas
                Jul 3, 2017 at 9:30
                Because the base64 module does ignore invalid non-base64 characters in the input, you first have to normalise  the data. Remove anything that's not a letter, digit / or +, and then add the padding.
– Martijn Pieters
                Nov 20, 2018 at 8:25
                @bp: In base64 encoding each 24 bits (3 bytes) binary input is encoded as 4 bytes output. output_len % 3 makes no sense.
– John Machin
                May 31, 2010 at 10:10
                Just appending === always works. Any extra = chars are seemingly safely discarded by Python.
– Asclepius
                Nov 24, 2019 at 21:30
                Dankje for the one-liner! Why the inner modulo though? ((4 - len(b64_string)) % 4) seems to return the same results for all values, including edge cases like len=0.
– Luc
                Nov 10, 2021 at 12:03
"Incorrect padding" can mean not only "missing padding" but also (believe it or not) "incorrect padding".
If suggested "adding padding" methods don't work, try removing some trailing bytes:
lens = len(strg)
lenx = lens - (lens % 4 if lens % 4 else 4)
    result = base64.decodestring(strg[:lenx])
except etc
Update: Any fiddling around adding padding or removing possibly bad bytes from the end should be done AFTER removing any whitespace, otherwise length calculations will be upset.
It would be a good idea if you showed us a (short) sample of the data that you need to recover. Edit your question and copy/paste the result of print repr(sample).
Update 2: It is possible that the encoding has been done in an url-safe manner. If this is the case, you will be able to see minus and underscore characters in your data, and you should be able to decode it by using base64.b64decode(strg, '-_')
If you can't see minus and underscore characters in your data, but can see plus and slash characters, then you have some other problem, and may need the add-padding or remove-cruft tricks.
If you can see none of minus, underscore, plus and slash in your data, then you need to determine the two alternate characters; they'll be the ones that aren't in [A-Za-z0-9]. Then you'll need to experiment to see which order they need to be used in the 2nd arg of base64.b64decode()
Update 3: If your data is "company confidential":

(a) you should say so up front

(b) we can explore other avenues in understanding the problem, which is highly likely to be related to what characters are used instead of + and / in the encoding alphabet, or by other formatting or extraneous characters.
One such avenue would be to examine what non-"standard" characters are in your data, e.g.
from collections import defaultdict
d = defaultdict(int)
import string
s = set(string.ascii_letters + string.digits)
for c in your_data:
   if c not in s:
      d[c] += 1
print d
                The data is comprised from the standard base64 character set. I'm pretty sure the problem is because 1 or more characters are missing - hence the padding error. Unless, there is a robust solution in Python, I'll go with my solution of calling openssl.
– FunLovinCoder
                Jun 2, 2010 at 13:13
                A "solution" that silently ignores errors is scarcely deserving of the term "robust". As I mentioned earlier, the various Python suggestions were methods of DEBUGGING to find out what the problem is, preparatory to a PRINCIPLED solution ... aren't you interested in such a thing?
– John Machin
                Jun 2, 2010 at 13:32
                My requirement is NOT to solve the problem of why the base64 is corrupt - it comes from a source I have no control over. My requirement is to provide information about the data received even if it is corrupt. One way to do this is to get the binary data out of the corrupt base64 so I can glean information from the underlying ASN.1. stream. I asked the original question because I wanted an answer to that question not the answer to another question - such as how to debug corrupt base64.
– FunLovinCoder
                Jun 2, 2010 at 14:01
                Just normalize the string, remove anything that is not a Base64 character. Anywhere, not just start or end.
– Martijn Pieters
                Nov 20, 2018 at 8:32
                The underlying binary data is ASN.1. Even with corruption I want to get back to the binary because I can still get some useful info from the ASN.1 stream.
– FunLovinCoder
                May 31, 2010 at 7:57
Incorrect padding error is caused because sometimes, metadata is also present in the encoded string
If your string looks something like: 'data:image/png;base64,...base 64 stuff....'
then you need to remove the first part before decoding it.
Say if you have image base64 encoded string, then try below snippet..
from PIL import Image
from io import BytesIO
from base64 import b64decode
imagestr = 'data:image/png;base64,...base 64 stuff....'
im = Image.open(BytesIO(b64decode(imagestr.split(',')[1])))
im.save("image.png")
Check the documentation of the data source you're trying to decode. Is it possible that you meant to use base64.urlsafe_b64decode(s) instead of base64.b64decode(s)? That's one reason you might have seen this error message.
  Decode string s using a URL-safe alphabet, which substitutes - instead
  of + and _ instead of / in the standard Base64 alphabet.
This is for example the case for various Google APIs, like Google's Identity Toolkit and Gmail payloads.
                This does not answer the question at all.  Plus, urlsafe_b64decode also requires padding.
– rdb
                Aug 2, 2016 at 11:33
                Well, there was an issue I had before answering this question, which was related to Google's Identity Toolkit. I was getting the incorrect padding error (I believe it was on the server) even tough the padding appeared to be correct. Turned out that I had to use base64.urlsafe_b64decode.
– Daniel F
                Aug 2, 2016 at 13:20
                I agree that it doesn't answer the question, rdb, yet it was exactly what I needed to hear as well. I rephrased the answer to a bit nicer tone, I hope this works for you, Daniel.
– Henrik Heimbuerger
                Jun 11, 2018 at 6:26
                Perfectly fine. I didn't notice that it sounded somewhat unkind, I only thought that it would be the quickest fix if it would fix the issue, and, for that reason, should be the first thing to be tried. Thanks for your change, it is welcome.
– Daniel F
                Jun 11, 2018 at 10:29
Adding the padding is rather... fiddly.  Here's the function I wrote with the help of the comments in this thread as well as the wiki page for base64 (it's surprisingly helpful) https://en.wikipedia.org/wiki/Base64#Padding.
import logging
import base64
def base64_decode(s):
    """Add missing padding to string and return the decoded base64 string."""
    log = logging.getLogger()
    s = str(s).strip()
        return base64.b64decode(s)
    except TypeError:
        padding = len(s) % 4
        if padding == 1:
            log.error("Invalid base64 string: {}".format(s))
            return ''
        elif padding == 2:
            s += b'=='
        elif padding == 3:
            s += b'='
        return base64.b64decode(s)
There are two ways to correct the input data described here, or, more specifically and in line with the OP, to make Python module base64's b64decode method able to process the input data to something without raising an un-caught exception:
Append == to the end of the input data and call base64.b64decode(...)
If that raises an exception, then
i. Catch it via try/except,
ii. (R?)Strip any = characters from the input data (N.B. this may not be necessary),
iii. Append A== to the input data (A== through P== will work),
iv. Call base64.b64decode(...) with those A==-appended input data
The result from Item 1. or Item 2. above will yield the desired result.
Caveats
This does not guarantee the decoded result will be what was originally encoded, but it will (sometimes?) give the OP enough to work with:
  Even with corruption I want to get back to the binary because I can still get some useful info from the ASN.1 stream").
See What we know and Assumptions below.
TL;DR
From some quick tests of base64.b64decode(...)
it appears that it ignores non-[A-Za-z0-9+/] characters; that includes ignoring =s unless they are the last character(s) in a parsed group of four, in which case the =s terminate the decoding (a=b=c=d= gives the same result as abc=, and a==b==c== gives the same result as ab==).
It also appears that all characters appended are ignored after the point where base64.b64decode(...) terminates decoding e.g. from an = as the fourth in a group.
As noted in several comments above, there are either zero, or one, or two, =s of padding required at the end of input data for when the [number of parsed characters to that point modulo 4] value is 0, or 3, or 2, respectively.  So, from items 3. and 4. above, appending two or more =s to the input data will correct any [Incorrect padding] problems in those cases.
HOWEVER, decoding cannot handle the case where the [total number of parsed characters modulo 4] is 1, because it takes a least two encoded characters to represent the first decoded byte in a group of three decoded bytes.  In uncorrupted encoded input data, this [N modulo 4]=1 case never happens, but as the OP stated that characters may be missing, it could happen here.  That is why simply appending =s will not always work, and why appending A== will work when appending == does not.  N.B. Using [A] is all but arbitrary:  it adds only cleared (zero) bits to the decoded, which may or not be correct, but then the object here is not correctness but completion by base64.b64decode(...) sans exceptions.
What we know from the OP and especially subsequent comments is
It is suspected that there are missing data (characters) in the
Base64-encoded input data
The Base64 encoding uses the standard 64 place-values plus padding:
A-Z; a-z; 0-9; +; /; = is padding.  This is confirmed, or at least
suggested, by the fact that openssl enc ... works.
Assumptions
The input data contain only 7-bit ASCII data
The only kind of corruption is missing encoded input data
The OP does not care about decoded output data at any point after that corresponding to any missing encoded input data
Github
Here is a wrapper to implement this solution:
https://github.com/drbitboy/missing_b64
In my case Gmail Web API was returning the email content as a base64 encoded string, but instead of encoded with the standard base64 characters/alphabet, it was encoded with the "web-safe" characters/alphabet variant of base64. The + and / characters are replaced with - and _. For python 3 use base64.urlsafe_b64decode().
                This answer does not seem related to the question. Could you please explain more in where the issue was located and how it is related?
– darclander
                Sep 5, 2020 at 13:14
                I got this issue on django while running the application on my chrome browser. Normally django application run on localhost. But today it doesn't work on localhost So I have to change this localhost to 127.0.0.1 . So now its work.It also works on other browser like firefox without changing localhost
– Nooras Fatima Ansari
                Sep 5, 2020 at 17:28
In case this error came from a web server: Try url encoding your post value. I was POSTing via "curl" and discovered I wasn't url-encoding my base64 value so characters like "+" were not escaped so the web server url-decode logic automatically ran url-decode and converted + to spaces.
"+" is a valid base64 character and perhaps the only character which gets mangled by an unexpected url-decode.
In my case I faced that error while parsing an email. I got the attachment as base64 string and extract it via re.search. Eventually there was a strange additional substring at the end.
dHJhaWxlcgo8PCAvU2l6ZSAxNSAvUm9vdCAxIDAgUiAvSW5mbyAyIDAgUgovSUQgWyhcMDAyXDMz
MHtPcFwyNTZbezU/VzheXDM0MXFcMzExKShcMDAyXDMzMHtPcFwyNTZbezU/VzheXDM0MXFcMzEx
KV0KPj4Kc3RhcnR4cmVmCjY3MDEKJSVFT0YK
--_=ic0008m4wtZ4TqBFd+sXC8--
When I deleted --_=ic0008m4wtZ4TqBFd+sXC8-- and strip the string then parsing was fixed up. 
So my advise is make sure that you are decoding a correct base64 string.
I ran into this problem as well and nothing worked.
I finally managed to find the solution which works for me. I had zipped content in base64 and this happened to 1 out of a million records...
This is a version of the solution suggested by Simon Sapin.
In case the padding is missing 3 then I remove the last 3 characters.
Instead of "0gA1RD5L/9AUGtH9MzAwAAA=="
We get "0gA1RD5L/9AUGtH9MzAwAA"
        missing_padding = len(data) % 4
        if missing_padding == 3:
            data = data[0:-3]
        elif missing_padding != 0:
            print ("Missing padding : " + str(missing_padding))
            data += '=' * (4 - missing_padding)
        data_decoded = base64.b64decode(data)   
According to this answer Trailing As in base64 the reason is nulls. But I still have no idea why the encoder messes this up...   
                cannot believe that worked and adding additional '='s didn't. Mine ended with "T4NCg==" and no amount of adding or subtracting '='s made any difference until I removed the 'g' on the end. I notice 'g' != 'A'
– rob
                Jun 29, 2021 at 16:29
Simply add additional characters like "=" or any other and make it a multiple of 4 before you try decoding the target string value. Something like;
if len(value) % 4 != 0: #check if multiple of 4
    while len(value) % 4 != 0:
        value = value + "="
    req_str = base64.b64decode(value)
else:
    req_str = base64.b64decode(value)
                This answer seems like it was supposed to be somewhere else, since there's no browser involved...?
– Jason Capriotti
                May 10, 2022 at 0:45
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.