sebastian-nagel commented on code in PR #1456: URL: https://github.com/apache/incubator-stormcrawler/pull/1456#discussion_r1931751832
########## core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java: ########## @@ -300,6 +300,22 @@ void testNonStandardPercentEncoding() throws MalformedURLException { assertEquals(expectedURL, normalizedUrl, "Failed to filter query string"); } + // https://github.com/apache/incubator-stormcrawler/issues/1448 + @Test + void testProperURLEncodingWithLowerCase() throws MalformedURLException { + URLFilter urlFilter = createFilter(queryParamsToFilter); + String urlWithEscapedCharacters = "http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d"; + String expectedResult = "http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d"; Review Comment: Shouldn't the expected result be `%3D%3D`? This is the canonical representation of percent-encoded characters defined in [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.1). If case variants of percent-encoded chars remain in URLs, this may cause duplicates. Note that in addition to pure lowercase variant, there could be also `%3d%3D` and `%3D%3d`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org