sebastian-nagel commented on PR #1456: URL: https://github.com/apache/incubator-stormcrawler/pull/1456#issuecomment-2618458250
After a closer look into the code: the reason for the issue is likely in [line 398 of BasicURLNormalizer](https://github.com/apache/incubator-stormcrawler/blob/5e02c159fa0824c14015c5de12aea6e4b046c67c/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java#L398). - a percent character is unconditionally converted to `%25` even if it's the first character of a valid percent-encoding - the "basic" URL normalizers of [Nutch](https://github.com/apache/nutch/blob/b52ec9025e40152b3a1dae7c78bb803c7ad298ce/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java#L369) and crawler-commons](https://github.com/crawler-commons/crawler-commons/blob/9f86fa3907194fb0e59f9d69218d22af09d8aec0/src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java#L570) treat the percent character separately and do not unconditionally escape it. All three "basic" URL normalizers share the same origin years ago, so they are still quite similar in their source code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org