This is an automated email from the ASF dual-hosted git repository.

joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git


The following commit(s) were added to refs/heads/master by this push:
     new fae42323d IMPALA-14144: Make pip_download.py more tolerant with PEP 
503 simple pages
fae42323d is described below

commit fae42323da6791958ddf5506219012ecd9492bab
Author: Laszlo Gaal <[email protected]>
AuthorDate: Fri Jun 13 23:01:14 2025 +0200

    IMPALA-14144: Make pip_download.py more tolerant with PEP 503 simple pages
    
    Recent package updates on PyPI have introduced package description
    pages that have extra newlines in addition to the newline character
    separating the complete URLs for the difference package versions.
    These extra newlines usually show up before the closing angle bracket
    character ('>') of the opening half of the anchor tag.
    
    This broke pip_download.py, because it uses a regex to crack out
    various data items (file name, download path, hash algorithm and hash
    value) from the download page. The regex attempts the whole anchor
    element up to and including the closing '</a>' tag, which fails because
    the '.' in a regex matches any character, except a newline. This failure
    causes all lines in the package descriptor page to be rejected as not
    matching the search pattern, so the package with a page in this format
    can never be recognized.
    
    This patch works around this formatting issue by adding the flag
    re.DOTALL to the regex search call, making the regex '.' character match
    the newline as well, so that the regex can match the complete anchor
    element across a line break as well.
    
    Change-Id: Ia56f87c54e0d9cad97b7e0ffbcce8f4c0f715c44
    Reviewed-on: http://gerrit.cloudera.org:8080/23026
    Reviewed-by: Joe McDonnell <[email protected]>
    Reviewed-by: Michael Smith <[email protected]>
    Tested-by: Joe McDonnell <[email protected]>
---
 infra/python/deps/pip_download.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/infra/python/deps/pip_download.py 
b/infra/python/deps/pip_download.py
index f9d442f23..c32cf1fe0 100755
--- a/infra/python/deps/pip_download.py
+++ b/infra/python/deps/pip_download.py
@@ -102,7 +102,7 @@ def get_package_info(pkg_name, pkg_version, 
is_canceled=None):
   pkg_info = subprocess.check_output(
       ["wget", "-q", "-O", "-", url], universal_newlines=True)
   regex = r'<a .*?href=\".*?packages/(.*?)#(.*?)=(.*?)\".*?>(.*?)<\/a>'
-  for match in re.finditer(regex, pkg_info):
+  for match in re.finditer(regex, pkg_info, flags=re.DOTALL):
     path = match.group(1)
     hash_algorithm = match.group(2)
     digest = match.group(3)

Reply via email to