Hi experts,

First time poster, here, so I know I'm risking not providing nearly enough of 
the right information. Please let me know what I can send to help you help me 
further through this.

I'm using separate deployments of Tomcat 9 on Linux (RedHat 7) and Windows for 
the same mature .war application.

Around Jan 2020 I found that uploads of ZIP files to the Linux Tomcat were 
getting corrupted. The Windows upload worked fine. After much digging I found 
this appears to relate to the file.encoding property.

Launching the Tomcat 9 service on Windows with "-Dfile.encoding=UTF-8" 
(overriding the default of Cp1252) causes the Windows upload to corrupt the 
data.

It would appear, therefore, that file.encoding is affecting binary file uploads 
and I do not think it should. With this set to utf-8, I am observing that 
invalid utf-8 characters are been replaced with "ef bf bd" (the BOM/"unknown 
character" for UTF-8).

Is there a way to address this?

I believe source .jsp files are utf-8 encoded and I deal with utf-8 in many 
parts of the application. I would rather add this encoding to the Windows 
deployments than use, e.g., -Dfile.encoding=ISO-8859-1 on Linux.

Note also "If the draft JEP discussed in this post is implemented, the default 
charset for file contents will be changed to UTF-8 even for Windows."
               Ref: 
https://dzone.com/articles/java-may-use-utf-8-as-its-default-charset (March 
1st, 2018)

I've put some details / "evidence" below should you wish to read further.

Thank you,
Tim


This morning, with Tomcat 9.0.45, I again captured a tcpdump to show that the 
browser is sending the correct data. The temp file which Tomcat created prior 
to passing the stream to my application is corrupted.

Part of the tcpdump submission is:

------WebKitFormBoundary37kBaouQxD4aoug5
Content-Disposition: form-data; name="file.ob_filename"; filename="MEP.zip"
Content-Type: application/x-zip-compressed

PK.........`.R................tbl_Evidence.csv.Zks.H..........[.=y.Do/..a.`......
 .T......i..{..$c......3X.Q..<y.d..&.|:.....&|..Q"....y(r...(  ....O....G....
;..Q,.q..e.&......P$.X..0*.3<T.K....O.........m<..8..b....|%.E...2...e^.......H}.F.|;.W+.....(

Captured with -X, this reads:
        0x0230:  6e61 6d65 3d22 4d45 502e 7a69 7022 0d0a  name="MEP.zip"..
        0x0240:  436f 6e74 656e 742d 5479 7065 3a20 6170  Content-Type:.ap
        0x0250:  706c 6963 6174 696f 6e2f 782d 7a69 702d  plication/x-zip-
        0x0260:  636f 6d70 7265 7373 6564 0d0a 0d0a 504b  compressed....PK
        0x0270:  0304 1400 0808 0800 8960 b352 0000 0000  .........`.R....
        0x0280:  0000 0000 0000 0000 1000 0000 7462 6c5f  ............tbl_
        0x0290:  4576 6964 656e 6365 2e63 7376 bd5a 6b73  Evidence.csv.Zks
        0x02a0:  e248 b2fd bebf a2c2 11b7 db8e 5b06 3d79  .H..........[.=y
        0x02b0:  f444 6f2f c6b8 61c6 6016 b9c7 b113 8e20  .Do/..a.`.......

The temp file shows:
$ od -t x1 upload_5e216399_71ab_4273_b38b_0410583a4edb_00000024.tmp | head
0000000 50 4b 03 04 14 00 08 08 08 00 ef bf bd 60 ef bf
0000020 bd 52 00 00 00 00 00 00 00 00 00 00 00 00 10 00
0000040 00 00 74 62 6c 5f 45 76 69 64 65 6e 63 65 2e 63
0000060 73 76 ef bf bd 5a 6b 73 ef bf bd 48 ef bf bd ef
0000100 bf bd ef bf bd ef bf bd ef bf bd ef bf bd 11 ef

As you may notice comparing this line with the first line of the od output:
        0x0270:  0304 1400 0808 0800 8960 b352 0000 0000  .........`.R....

The "89" and "b3" (no doubt an invalid utf-8 characters) have been replaced 
with "ef bf bd". This is repeated later for each subsequent invalid utf-8 
character.

In case this is relevant, I'm using Amazon's Corretto JDK 11.0.4 (64-bit) on 
Linux (11.0.7 now on Windows) but I've observed this problem with JDK8 and I 
can't say when it started. I know it worked a few years ago on Linux and 
Windows, but can't dig out the version information for then.
               NB: Just updated to JDK 11.0.11 and it made no difference.

My extensive, repeated and varied searches merely confirm that my HTML is OK, 
the form submission is as intended. Maybe the process for reading the data is 
out of date but it works fine on Windows (Java is meant to be a WORM language) 
and all the debugging I do shows that the data is corrupt before my application 
sees it.

My JVM property file.encoding = UTF-8 on Linux and was Cp1252 on Windows.

--
Tim Scott
OCLC * Senior Software Engineer / Technical Product Manager
CityGate, 8 St. Mary's Gate, Sheffield S1 4LW, UK

cc: IT file

OCLC COVID-19 resources: 
oc.lc/covid19-service-info<https://oc.lc/covid19-service-info>
[COVID-19: We're in this 
together]<https://www.oclc.org/en/covid-19.html?utm_campaign=covid-19-support&utm_medium=email&utm_source=libraryservices&utm_content=signature-banner-covid-19-information-resources>

Reply via email to