[issue23165] Heap overwrite in Python/fileutils.c:_Py_char2wchar() on 32 bit systems due to malloc parameter overflow

Guido Vranken Sun, 04 Jan 2015 08:51:06 -0800

New submission from Guido Vranken:

The vulnerability described here is exceedingly difficult to exploit, since 
there is no straight-forward way an "attacker" (someone who controls a Python 
script contents but not other values such as system environment variables), can 
control a relevant parameter to the vulnerable function (_Py_char2wchar in 
Python/fileutils.c). It is, however, important that it is remediated since 
unawareness of this vulnerability may cause an unsuspecting author to establish 
a link between user and the function parameter in future versions of Python.


Like I said, the vulnerability is caused by code in the _Py_char2wchar 
function. Indirectly this function is accessed through 
Objects/unicodeobject.c:PyUnicode_DecodeLocaleAndSize(), 
PyUnicode_DecodeFSDefaultAndSize(), PyUnicode_DecodeLocale, and some other 
functions.

As far as I know this can only be exploited on 32-bit architectures (whose 
overflow threshold of its registers is  2**32). The following description sets 
out from the latest Python 3.4 code retrieved from 
https://hg.python.org/cpython .

The problem lies in the computation of size of the buffer that will hold the 
wide char version of the input string:

--
Python/fileutils.c
--
 296 #ifdef HAVE_BROKEN_MBSTOWCS
 297     /* Some platforms have a broken implementation of
 298      * mbstowcs which does not count the characters that
 299      * would result from conversion.  Use an upper bound.
 300      */
 301     argsize = strlen(arg);
 302 #else
 303     argsize = mbstowcs(NULL, arg, 0);
 304 #endif
 ...
 ...
 306         res = (wchar_t *)PyMem_RawMalloc((argsize+1)*sizeof(wchar_t));

 and:

 331     argsize = strlen(arg) + 1;
 332     res = (wchar_t*)PyMem_RawMalloc(argsize*sizeof(wchar_t));

Both invocations to PyMem_RawMalloc are not preceded by code that asserts no 
overflow will occur as a result of multiplication of the length of 'arg' by 
sizeof(wchar_t), which is typically 4 bytes. It follows that on a 32-bit 
architecture, it is possible cause an internal overflow to occur through the 
supplication of a string whose size is >= ((2**32)-1) / 4, which is 1 gigabyte. 
The supplication of a 1 GB (minus one byte) string will therefore result in a 
value of 0 being passed to PyMem_RawMalloc, because:

        argsize = 1024*1024*1024-1
        malloc_argument = ((argsize+1) * 4
        print malloc_argument & 0xFFFFFFFF
        # prints '0'
        
Effectively this will result in an allocation of exactly 1 byte, since a 
parameter of 0 is automatically adjusted to 1 by the underlying 
_PyMem_RawMalloc():

--
Objects/obmalloc.c
--
  51 static void *
  52 _PyMem_RawMalloc(void *ctx, size_t size)
  53 {
  54     /* PyMem_Malloc(0) means malloc(1). Some systems would return NULL
  55        for malloc(0), which would be treated as an error. Some platforms 
would
  56        return a pointer with no memory behind it, which would break 
pymalloc.
  57        To solve these problems, allocate an extra byte. */
  58     if (size == 0)
  59         size = 1;
  60     return malloc(size);
  61 }


Once the memory has been allocated, mbstowcs() is invoked:

--
Python/fileutils.c
--

 306         res = (wchar_t *)PyMem_RawMalloc((argsize+1)*sizeof(wchar_t));
 307         if (!res)
 308             goto oom;
 309         count = mbstowcs(res, arg, argsize+1);

In my test setup (latest 32 bit Debian), mbstowcs returns '0', meaning no bytes 
were written to 'res'.

Then, 'res' is iterated over and the iteration is halted as soon as a 
null-wchar or a wchar which is a surrogate:

--
Python/fileutils.c
--

 310         if (count != (size_t)-1) {
 311             wchar_t *tmp;
 312             /* Only use the result if it contains no
 313                surrogate characters. */
 314             for (tmp = res; *tmp != 0 &&
 315                          !Py_UNICODE_IS_SURROGATE(*tmp); tmp++)
 316                 ;
 317             if (*tmp == 0) {
 318                 if (size != NULL)
 319                     *size = count;
 320                 return res;
 321             }
 322         }
 323         PyMem_RawFree(res);


Py_UNICODE_IS_SURROGATE is defined as follows:

--
Include/unicodeobject.h
--
 183 #define Py_UNICODE_IS_SURROGATE(ch) (0xD800 <= (ch) && (ch) <= 0xDFFF)

In the iteration over 'res', control is transferred back to the invoker of 
_Py_char2wchar() if a null-wchar is encountered first. If, however, a wchar 
that does satisfies the expression in Py_UNICODE_IS_SURROGATE() is encountered 
first, *tmp is not null and thus the conditional code on lines 318-320 is 
skipped.
The space that 'res' points to is unintialized. Uninitialized, however, does 
not not entail randomness in this case. If an attacker has sufficient freedom 
to manipulate the contents of the process memory prior to calling 
_Py_char2wchar() in order to scatter it with values that satisfy 
Py_UNICODE_IS_SURROGATE(), this could increase their odds of having 
_Py_char2wchar() encounter such a value before a null-wchar. These kinds of 
details are very dependant on system architecture, operating system, libc 
implementation and so forth.

The remainder of the function will perform a byte-per-byte conversion embedded 
in a loop, to manually convert the entire input string. Especially relevant to 
this vulnerability are lines 332, 339, 356 and 365, 366:

On line 332 memory is allocated, effectively only 1 byte as explained above. 
'argsize', however, is 0x40000000 in our case, and the entire routine is 
repeated until argsize is 0.
On line 339 one or more characters are converted, and stored into 'out', which 
is 'res'. Lines 356 and 366 do the same.

--
Python/fileutils.c
--
 325     /* Conversion failed. Fall back to escaping with surrogateescape. */
 326 #ifdef HAVE_MBRTOWC
 327     /* Try conversion with mbrtwoc (C99), and escape non-decodable bytes. 
*/
 328
 329     /* Overallocate; as multi-byte characters are in the argument, the
 330        actual output could use less memory. */
 331     argsize = strlen(arg) + 1;
 332     res = (wchar_t*)PyMem_RawMalloc(argsize*sizeof(wchar_t));
 333     if (!res)
 334         goto oom;
 335     in = (unsigned char*)arg;
 336     out = res;
 337     memset(&mbs, 0, sizeof mbs);
 338     while (argsize) {
 339         size_t converted = mbrtowc(out, (char*)in, argsize, &mbs);
 340         if (converted == 0)
 341             /* Reached end of string; null char stored. */
 342             break;
 343         if (converted == (size_t)-2) {
 344             /* Incomplete character. This should never happen,
 345                since we provide everything that we have -
 346                unless there is a bug in the C library, or I
 347                misunderstood how mbrtowc works. */
 348             PyMem_RawFree(res);
 349             if (size != NULL)
 350                 *size = (size_t)-2;
 351             return NULL;
 352         }
 353         if (converted == (size_t)-1) {
 354             /* Conversion error. Escape as UTF-8b, and start over
 355                in the initial shift state. */
 356             *out++ = 0xdc00 + *in++;
 357             argsize--;
 358             memset(&mbs, 0, sizeof mbs);
 359             continue;
 360         }
 361         if (Py_UNICODE_IS_SURROGATE(*out)) {
 362             /* Surrogate character.  Escape the original
 363                byte sequence with surrogateescape. */
 364             argsize -= converted;
 365             while (converted--)
 366                 *out++ = 0xdc00 + *in++;
 367             continue;
 368         }
 369         /* successfully converted some bytes */
 370         in += converted;
 371         argsize -= converted;
 372         out++;
 373     }
 374     if (size != NULL)
 375         *size = out - res;
 376 #else   /* HAVE_MBRTOWC */
 377     /* Cannot use C locale for escaping; manually escape as if charset
 378        is ASCII (i.e. escape all bytes > 128. This will still roundtrip
 379        correctly in the locale's charset, which must be an ASCII superset. 
*/
 380     res = decode_ascii_surrogateescape(arg, size);
 381     if (res == NULL)
 382         goto oom;
 383 #endif   /* HAVE_MBRTOWC */

Suffice it to say that this leads to writing to memory that has not been 
allocated, thereby making this a heap overflow vulnerability. 
decode_ascii_surrogateescape() seems to suffer from the same issue.

Guido Vranken,

Intelworks
http://www.intelworks.com/

----------
files: _py_char2wchar_patches.tar.gz
messages: 233424
nosy: Guido
priority: normal
severity: normal
status: open
title: Heap overwrite in Python/fileutils.c:_Py_char2wchar() on 32 bit systems 
due to malloc parameter overflow
type: security
versions: Python 3.2, Python 3.3, Python 3.4
Added file: http://bugs.python.org/file37597/_py_char2wchar_patches.tar.gz

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23165>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue23165] Heap overwrite in Python/fileutils.c:_Py_char2wchar() on 32 bit systems due to malloc parameter overflow

Reply via email to