mstrewe opened a new issue, #1448:
URL: https://github.com/apache/incubator-stormcrawler/issues/1448

   The BasicUrlNormalizer will encode links if they are not already URL  
encoded. 
   
   The Bug occurs when URL has encoded chars in smaller case like 
`'/Exhibitions/Detail/NjAxOA%3d%3d'`. (the URL 
`'/Exhibitions/Detail/NjAxOA%3D%3D'` is not affected)
   
   In BasicUrlNormalizer.java from line 145-150 the file of the URL gets 
unescaped and escaped again. After that the original file and the es-unes-caped 
file are compared. It will be 
   
   `Exhibitions/Detail/NjAxOA%3d%3d == Exhibitions/Detail/NjAxOA%3D%3D`   
(Capital D)
   
   After that the original source URL will be reacreated (line 154) and results 
in 'Exhibitions/Detail/NjAxOA%253D%253D' 
   
   
   
   Can be fixed if the statement in line 148
   
   ```
    if (!file.equals(file2)) {
    ```
   
   will changed to 
   
   ```
    if (!file.toLowerCase().equals(file2.toLowerCase())) {
   ```
   
   UpperCase doesnt matter. But now it does not 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to