[ 
https://issues.apache.org/jira/browse/IMPALA-14144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17974023#comment-17974023
 ] 

Laszlo Gaal commented on IMPALA-14144:
--------------------------------------

The problem is caused by some of the package-specific pages being formatted in 
a peculiar way:
* pages for packages that could be found have a single newline (LF, 0x0a) 
character before the starting angle bracket of the anchor tag), e.g. the 
{{future}} package
* pages for packages that fail to be found by our script have an extra newline 
before the closing angle bracket of the anchor tag in addition to the first 
one, e.g. the {{hdfs}} or {{kazoo}} package
This can be verified with {{xxd}} in a hex dump, or just by looking at the page 
source, and convincing yourself that the extra line break before the closure of 
the achor tag is not an implicit newline caused by a narrow browser window.
Unfortunately the second newline is not shown in a normal browser view, as HTML 
just considers it whitespace, and -- being inside the anchor tag's non-visible 
part -- it doesn't get rendered.

{{pip_download.py}} uses a regex in L104 to capture the whole anchor tag, 
including the displayed filename and the closing </a> tag from the output 
produced by {code}subprocess.check_call(["wget", ......], .., 
universal_newlines=True){code}
The extra newline before the closing bracket of the {{<a .... >}} tag (the 
opening tag) pushes the closing {{</a>}} tag to a new line, so the regex search 
will never see a complete line that has the opening and the closing tag of the 
anchor tag on the same line: the '.' in the regex pattern matches anything 
_except_ the newline character -- so it won't be able to find a single match 
even if it reads the whole file for the package.

> pip_download.py fails to download several packages from pypi.org
> ----------------------------------------------------------------
>
>                 Key: IMPALA-14144
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14144
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Laszlo Gaal
>            Priority: Blocker
>
> infra/python/deps/pip_download.py runs at the start of buildall.sh to ensure 
> that all the Python requirements can be installed into the Impala virtualenv 
> used by the test framework. This download was implemented to download the 
> packages in multiple parallel streams.
> Recently this downloader has started failing: it reports a complete download 
> failure for several packages, e.g. {{hdfs}}, {{impyla}}, {{bitarray}} and 
> several others.
> The failure is not caused by a network communication problem, as the same 
> packages from the same repo can be successfully downloaded with a browser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to