sebastian-nagel commented on code in PR #1456:
URL:
https://github.com/apache/incubator-stormcrawler/pull/1456#discussion_r1931751832
##########
core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java:
##########
@@ -300,6 +300,22 @@ void testNonStandardPercentEncoding() throws
MalformedURLException {
assertEquals(expectedURL, normalizedUrl, "Failed to filter query
string");
}
+ // https://github.com/apache/incubator-stormcrawler/issues/1448
+ @Test
+ void testProperURLEncodingWithLowerCase() throws MalformedURLException {
+ URLFilter urlFilter = createFilter(queryParamsToFilter);
+ String urlWithEscapedCharacters =
"http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d";
+ String expectedResult =
"http://www.example.com/Exhibitions/Detail/NjAxOA%3d%3d";
Review Comment:
Shouldn't the expected result be `%3D%3D`?
This is the canonical representation of percent-encoded characters defined
in [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.1).
If case variants of percent-encoded chars remain in URLs, this may cause
duplicates. Note that in addition to pure lowercase variant, there could be
also `%3d%3D` and `%3D%3d`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]