sebastian-nagel commented on code in PR #1456:
URL: 
https://github.com/apache/incubator-stormcrawler/pull/1456#discussion_r1931751832


##########
core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java:
##########
@@ -300,6 +300,22 @@ void testNonStandardPercentEncoding() throws 
MalformedURLException {
         assertEquals(expectedURL, normalizedUrl, "Failed to filter query 
string");
     }
 
+    // https://github.com/apache/incubator-stormcrawler/issues/1448
+    @Test
+    void testProperURLEncodingWithLowerCase() throws MalformedURLException {
+        URLFilter urlFilter = createFilter(queryParamsToFilter);
+        String urlWithEscapedCharacters = 
"http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d";;
+        String expectedResult = 
"http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d";;

Review Comment:
   Shouldn't the expected result be `%3D%3D`?
   
   This is the canonical representation of percent-encoded characters defined 
in [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.1).
   
   If case variants of percent-encoded chars remain in URLs, this may cause 
duplicates. Note that in addition to pure lowercase variant, there could be 
also `%3d%3D` and `%3D%3d`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to