-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Thorsten,
On 11/26/18 08:45, Thorsten Schöning wrote: > Hi all, > > I'm currently testing migration of a legacy web app from Tomcat 7 > to 8 to 8.5 and ran into problems regarding character encoding in > 8.5 only. That app uses JSP pages and declares all of those to be > stored in UTF-8, does really do so :-), and declares a HTTP-Content > type of "text/html; charset=UTF-8" as well. Textual content at > HTML-level is properly encoded using UTF-8 and looks properly in > the browser etc. > > In Tomcat 8.5 the following is introducing encoding problems, > though: > >> <jsp:include page="/WEB-INF/jsp/includes/search.jsp"> <jsp:param >> name="chooseSearchInputTitle" value="Benutzer wählen" /> >> </jsp:include> > > "search.jsp" simply outputs the value of the param as the "title" > attribute of some HTML-link and the character "ä" is replaced > somewhere with the Unicode character REPLACEMENT CHARACTER 0xFFFD. > But really only in Tomcat 8.5, not in 8 and not in 7. Have you been able to determine if the problem is on input or output? > I can fix that problem using either "SetCharacterEncodingFilter" > or the following line, which simply results in the same I guess: > >> <% request.setCharacterEncoding("UTF-8"); %> FYI the SetCharacterEncodingFilter only modifies request encoding and not response encoding. Also, it only changes the encoding of the request *body* (e.g. PUT/POST), and not the encoding used to decode the URI. That's configured in <Connector>'s URIEncoding. There is also useBodyEncodingForURI which inherits the request body's encoding if it's present. I recommend using useBodyEncodingForURI="true". I recommend *always* using SetCharacterEncodingFilter, since web browsers both habitually refuse to send a correct content/type and often use UTF-8 in URLs in violation of the HTTP spec. The result is essentially that everything works the way you *want* it to work, except that you just have to "hope" it works instead of being able to prove that it will. > Looking at the generated Java code for the JSP I get the > following: > >> org.apache.jasper.runtime.JspRuntimeLibrary.include(request, >> response, "/WEB-INF/jsp/includes/search.jsp" + "?" + >> org.apache.jasper.runtime.JspRuntimeLibrary.URLEncode("chooseSearchIn putTitle", >> request.getCharacterEncoding())+ "=" + >> org.apache.jasper.runtime.JspRuntimeLibrary.URLEncode("Benutzer >> wählen", request.getCharacterEncoding()), out, false); > > The "ä" is properly encoded using UTF-8 in all versions of Tomcat > and the generated code seems to be the same in all versions as > well, especially regarding "request.getCharacterEncoding()". > > "getCharacterEncoding" in Tomcat 8.8 has changed, the former > implementation didn't take the context into account: > >> @Override public String getCharacterEncoding() { String >> characterEncoding = coyoteRequest.getCharacterEncoding(); if >> (characterEncoding != null) { return characterEncoding; } >> >> Context context = getContext(); if (context != null) { return >> context.getRequestCharacterEncoding(); } >> >> return null; } This is just a fall-back for when there is no character encoding defined in the request (because the browser didn't send one). > My connector in server.xml is configured to use "URIEncoding" as > UTF-8 in all versions of Tomcat, but that doesn't make a difference > to 8.5. So I understand that using "setCharacterEncoding", I set > the value actually used in the generated Java now, even though the > following is documented for character encoding filter: > >> Note that the encoding for GET requests is not set here, but on a >> Connector > > https://tomcat.apache.org/tomcat-8.5-doc/config/filter.html#Set_Charac ter_Encoding_Filter/Introduction > > Now I'm wondering about multiple things... > > 1. Doesn't "getCharacterEncoding" provide the encoding of the > HTTP-body? Yes, but it comes directly from the browser, who often doesn't provide it. There is no encoding-detection going on, so it's often "null" or ISO-8859-1, which is the spec-defined default. > My JSP is called using GET and the Java quoted above seems to build > a query string as well. So why does it depend on some body encoding > instead of e.g. URIEncoding of the connector? Good question. Might be a bug, here. > 2. Is my former approach wrong or did changes in Tomcat 8.5 > introduce some regression? There is some conversion somewhere which > was not present in the past. Tomcat 8.5 follows the servlet spec, which in v4.0 added the <web-app><request-character-encoding> to make things even more fun. Actually, this can replace the use of the SetCharacterEncodingFilter. Thanks for pointing this out; I wasn't aware of this feature of the 4.0 spec. > 3. What is the correct fix I need now? The character encoding > filter, even though it only applies to bodies per documentation? Try setting <request-character-encoding> in your <web-app> like this: web.xml - ------- <web-app> <request-character-encoding>UTF-8</request-character-encoding> </web-app> - -chris -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlv8DEYACgkQHPApP6U8 pFjbihAAuX3vNtHpJ2qLpIofvz83wFbCxyVsgnRPGIQsqT/wxskOizwkKCmxnITc pYEJHOEjF5U+C9QJtyC4iPz/Dj9MOfk8986NZ/9bhxFuGJsAifO1HKZ2vTvf9dYD s5yAPJryQYaShgiDRPopYDgCOWi6a9mQMjvQeYclQjFAOa3MWMa4tlnKD2mOL4GQ X/PuUiKA97XMmj6LZTwh9dGJwU2Fi6LlWOIXXP2qAB8RmcfIlDr20/m1OKg4l0Z3 dVzbD0rWM7tNCtDhnybclamdKv+apDJGS3NtTHzScXlqT51EdUiKup+mTJbaRncD okL9MKlGLZYe5ankTGHaNH5P4BfhSv1BUYwiTXpUMgVpuAl5AMxEwu5ZHdoyeSJm +B27/RLXMFue25Qtni6op06ssJGjQZyR5AxAN4qO/k3eTJUzAp5tLiJlbpJbMIzd fEiL2kIkvIeHUE6Iz39deaWsFqu6m1hweSGcTXsvky0mEi20QZ9Pa+1E9UTvii20 HL0h/MxKlfJFc7yXmLU2SpTho4lTLUIMD57XOuYPQTkHBcW0QoHJLSCymANx/wpv OdPjXsqGDBAKWteRTaB7caqU0Fb+Z3UHA8PUIjT4sPW88uHkRGA5XRLMWWlXe+Cx DVwykOEkBaKXLWzZ51R+cYoWEWKtbR0pzEW+dA9JEMClWMrovkg= =pfKy -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org