New submission from Géry <gery.o...@gmail.com>:

The Python library documentation of the `urllib.parse.urlunparse 
<https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunparse>`_ 
and `urllib.parse.urlunsplit 
<https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunsplit>`_ 
functions states:

    This may result in a slightly different, but equivalent URL, if the URL 
that was parsed originally had unnecessary delimiters (for example, a ? with an 
empty query; the RFC states that these are equivalent).

So with the <http://example.com/?> URI::

    >>> import urllib.parse
    >>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/?";))
    'http://example.com/'
    >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?";))
    'http://example.com/'

But `RFC 3986 <https://tools.ietf.org/html/rfc3986?#section-6.2.3>`_ states the 
exact opposite:

    Normalization should not remove delimiters when their associated component 
is empty unless licensed to do so by the scheme specification.  For example, 
the URI "http://example.com/?"; cannot be assumed to be equivalent to any of the 
examples above.  Likewise, the presence or absence of delimiters within a 
userinfo subcomponent is usually significant to its interpretation.  The 
fragment component is not subject to any scheme-based normalization; thus, two 
URIs that differ only by the suffix "#" are considered different regardless of 
the scheme.

So maybe `urllib.parse.urlunparse` ∘ `urllib.parse.urlparse` and 
`urllib.parse.urlunsplit` ∘ `urllib.parse.urlsplit` are not supposed to be used 
for `syntax-based normalization 
<https://tools.ietf.org/html/rfc3986?#section-6>`_ of URIs. But still, both 
`urllib.parse.urlparse` or `urllib.parse.urlsplit` lose the "delimiter + empty 
component" information of the URI string, so they report false equivalent URIs::

    >>> import urllib.parse
    >>> urllib.parse.urlparse("http://example.com/?";) == 
urllib.parse.urlparse("http://example.com/";)
    True
    >>> urllib.parse.urlsplit("http://example.com/?";) == 
urllib.parse.urlsplit("http://example.com/";)
    True

P.-S. — Is there a syntax-based normalization function of URIs in the Python 
library?

----------
components: Library (Lib)
messages: 350663
nosy: Jeremy.Hylton, maggyero, orsenthil
priority: normal
severity: normal
status: open
title: urllib.parse functions reporting false equivalent URIs
type: behavior
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37969>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to