sebastian-nagel commented on PR #1456:
URL: 
https://github.com/apache/incubator-stormcrawler/pull/1456#issuecomment-2618458250

   After a closer look into the code: the reason for the issue is likely in 
[line 398 of 
BasicURLNormalizer](https://github.com/apache/incubator-stormcrawler/blob/5e02c159fa0824c14015c5de12aea6e4b046c67c/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java#L398).
   - a percent character is unconditionally converted to `%25` even if it's the 
first character of a valid percent-encoding
   - the "basic" URL normalizers of 
[Nutch](https://github.com/apache/nutch/blob/b52ec9025e40152b3a1dae7c78bb803c7ad298ce/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java#L369)
 and 
crawler-commons](https://github.com/crawler-commons/crawler-commons/blob/9f86fa3907194fb0e59f9d69218d22af09d8aec0/src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java#L570)
 treat the percent character separately and do not unconditionally escape it. 
All three "basic" URL normalizers share the same origin years ago, so they are 
still quite similar in their source code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to