[issue33973] HTTP request-line parsing splits on Unicode whitespace

Tim Burke Tue, 26 Jun 2018 12:41:51 -0700

New submission from Tim Burke <tim.bu...@gmail.com>:

This causes (admittedly, buggy) clients that would work with a Python 2 server 
to stop working when the server upgrades to Python 3. To demonstrate, run 
`python2.7 -m SimpleHTTPServer 8027` in one terminal and `curl -v 
http://127.0.0.1:8027/你好` in another -- curl reports


    *   Trying 127.0.0.1...
    * TCP_NODELAY set
    * Connected to 127.0.0.1 (127.0.0.1) port 8027 (#0)
    > GET /你好 HTTP/1.1
    > Host: 127.0.0.1:8027
    > User-Agent: curl/7.54.0
    > Accept: */*
    >
    * HTTP 1.0, assume close after body
    < HTTP/1.0 404 File not found
    < Server: SimpleHTTP/0.6 Python/2.7.10
    < Date: Tue, 26 Jun 2018 17:23:25 GMT
    < Content-Type: text/html
    < Connection: close
    <
    <head>
    <title>Error response</title>
    </head>
    <body>
    <h1>Error response</h1>
    <p>Error code 404.
    <p>Message: File not found.
    <p>Error code explanation: 404 = Nothing matches the given URI.
    </body>
    * Closing connection 0

...while repeating the experiment with `python3.6 -m http.server 8036` and 
`curl -v http://127.0.0.1:8036/你好` gives

    *   Trying 127.0.0.1...
    * TCP_NODELAY set
    * Connected to 127.0.0.1 (127.0.0.1) port 8036 (#0)
    > GET /你好 HTTP/1.1
    > Host: 127.0.0.1:8036
    > User-Agent: curl/7.54.0
    > Accept: */*
    >
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
            "http://www.w3.org/TR/html4/strict.dtd";>
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
            <title>Error response</title>
        </head>
        <body>
            <h1>Error response</h1>
            <p>Error code: 400</p>
            <p>Message: Bad request syntax ('GET /ä½\xa0å¥½ HTTP/1.1').</p>
            <p>Error code explanation: HTTPStatus.BAD_REQUEST - Bad request 
syntax or unsupported method.</p>
        </body>
    </html>
    * Connection #0 to host 127.0.0.1 left intact

Granted, a well-behaved client would have quoted the UTF-8 '你好' as 
'%E4%BD%A0%E5%A5%BD' (in which case everything would have behaved as expected), 
but RFC 7230 is pretty clear that the request-line should be SP-delimited. 
While it notes that "recipients MAY instead parse on whitespace-delimited word 
boundaries and, aside from the CRLF terminator, treat any form of whitespace as 
the SP separator", it goes on to say that "such whitespace includes one or more 
of the following octets: SP, HTAB, VT (%x0B), FF (%x0C), or bare CR" with no 
mention of characters like the (ISO-8859-1 encoded) non-breaking space that 
caused the 400 response.

FWIW, there was a similar unicode-separators-are-not-the-right-separators bug 
in header parsing a while back: https://bugs.python.org/issue22233

----------
components: Library (Lib), Unicode
messages: 320507
nosy: ezio.melotti, tburke, vstinner
priority: normal
severity: normal
status: open
title: HTTP request-line parsing splits on Unicode whitespace
type: behavior
versions: Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue33973>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue33973] HTTP request-line parsing splits on Unicode whitespace

Reply via email to