https://issues.apache.org/bugzilla/show_bug.cgi?id=52211

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #9 from Yegor Kozlov <ye...@dinom.ru> ---
It is very likely that your hypothesis is correct and this oine of code can
cause problems.

The problematic piece of code exists since POI-3.5, when OpenXml4j was
contributed to Apache POI. 
I guess the intention was to ensure that the string being parsed and validated
is in the ASCII encoding. 
This "worked" for years but the conversion does not make sense because if the
input argument contains characters above ASCII then they are converted to
0XFFFD ("not a character" unicode) and the subsequent validation against the
patternMediaType regex fails.

Consider the following examples:

(a) new ContentType("text/\u007E") 
(b) new ContentType("text/\u0080") 

The first case (a) works because all characters in the input string are in
ASCII and the conversion does not change the input string. 
The second case (b) fails no matter if the input argument is re-converted to
US-ASCII or not. If you apply your fix (contentTypeASCII=contentType) then the
regex check at line 146 fails. Current code first converts the input string to
"text/\uFFFD" and then the regex fails.

So I agree that this conversion is extra and can be removed. The fix is coming
soon.

Regards,
Yegor

(In reply to comment #8)
> Hello,
> 
> We are using the POI API (stable 3.8) on a system running ibm500 encoding as
> default encoding.
> So we got the same error, when trying to create a Workbook using
> WorkbookFactory.create(ByteArrayInputStream bais).
> 
> We found that the problem lies in the method
> org.apache.poi.openxml4j.opc.internal.ContentType.ContentType(String
> contentType)
> 
> In line 139, the follwoing code is called:
> contentTypeASCII = new String(contentType.getBytes(), "US-ASCII");
> 
> The String.getBytes() causes the system to return the bytes in default
> system encoding (for instance ibm500). Afterwards this should be converted
> using encoding US-ASCII. This cannot work.
> 
> So, we wonder, why this conversion will be done?
> 
> We deleted the line and just put following code:
> contentTypeASCII = contentType;
> 
> Afterwards it worked fine.
> 
> Regards
> Constantin

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to