mstrewe commented on PR #1456: URL: https://github.com/apache/incubator-stormcrawler/pull/1456#issuecomment-2618568436
> After a closer look into the code: the reason for the issue is likely in [line 398 of BasicURLNormalizer](https://github.com/apache/incubator-stormcrawler/blob/5e02c159fa0824c14015c5de12aea6e4b046c67c/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java#L398). > > * a percent character is unconditionally converted to `%25` even if it's the first character of a valid percent-encoding > * the "basic" URL normalizers of [Nutch](https://github.com/apache/nutch/blob/b52ec9025e40152b3a1dae7c78bb803c7ad298ce/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java#L369) and [crawler-commons](https://github.com/crawler-commons/crawler-commons/blob/9f86fa3907194fb0e59f9d69218d22af09d8aec0/src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java#L570) treat the percent character separately and do not unconditionally escape it. All three "basic" URL normalizers share the same origin years ago, so they are still quite similar in their source code. I dont think so. In the given URL of the test, the URL is unescaped first in line 146 and then escaped in 147. Until now the encoding differ only in upper and lower case (the percent is not yet encoded again) ``` // .../NjAxOA%3d%3d - file String file2 = unescapePath(file); // .../NjAxOA== - file2 file2 = escapePath(file2); // .../NjAxOA%3D%3D - file2 ``` So the escaping unescaping works like expected. But since the letters now upper case `equals` (without ignore case) will lead to line 152, which will create a new URL with file 2. ``` urlToFilter = new URL(protocol, host, port, file2).toString(); ``` This line will then encode the percentage character again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org