mstrewe commented on PR #1456:
URL: 
https://github.com/apache/incubator-stormcrawler/pull/1456#issuecomment-2618568436

   > After a closer look into the code: the reason for the issue is likely in 
[line 398 of 
BasicURLNormalizer](https://github.com/apache/incubator-stormcrawler/blob/5e02c159fa0824c14015c5de12aea6e4b046c67c/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java#L398).
   > 
   > * a percent character is unconditionally converted to `%25` even if it's 
the first character of a valid percent-encoding
   > * the "basic" URL normalizers of 
[Nutch](https://github.com/apache/nutch/blob/b52ec9025e40152b3a1dae7c78bb803c7ad298ce/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java#L369)
 and 
[crawler-commons](https://github.com/crawler-commons/crawler-commons/blob/9f86fa3907194fb0e59f9d69218d22af09d8aec0/src/main/java/crawlercommons/filters/basic/BasicURLNormalizer.java#L570)
 treat the percent character separately and do not unconditionally escape it. 
All three "basic" URL normalizers share the same origin years ago, so they are 
still quite similar in their source code.
   
   I dont think so. 
   In the given URL of the test, the URL is unescaped first in line 146 and 
then escaped in 147. Until now the encoding differ only in upper and lower case 
(the percent is not yet encoded again)
   
   ```
   // .../NjAxOA%3d%3d     - file
   String file2 = unescapePath(file);
   // .../NjAxOA==    - file2
   file2 = escapePath(file2);
   // .../NjAxOA%3D%3D   - file2
   ```
   
   So the escaping unescaping works like expected. 
   
   But since the letters now upper case `equals` (without ignore case) will 
lead to line 152, which will create a new URL with file 2.
   ```
   urlToFilter = new URL(protocol, host, port, file2).toString();
   ```
   This line will then encode the percentage character again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to