[
https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633487#comment-14633487
]
Tim Allison edited comment on TIKA-1238 at 7/20/15 3:34 PM:
------------------------------------------------------------
The stacktrace is related to my original problem, but actually shows an
inconsistency in POI's handling of {{UnsupportedEncodingException}}. POI has a
try-catch block for that exception only on the first choice for guessing 7 bit
encoding. The second and third choice take whatever value could be pulled out
of the header or the html meta-equiv and {{set7BitEncoding(charset)}} without
the try-catch block.
Turns out another problem is that, of course, {{Charset.forName()}} can throw
an {{UnsupportedCharsetException}} (not {{UnsupportedEncodingException}})...so
that's not even checked for in POI's code. And, while we're defending against
trying to create a charset from whatever value we find in msg/html headers or
codepoint values, we should also add IllegalCharsetName in the catch block...or
just go for IllegalArgumentException and be done with it. :)
As an immediate fix at the Tika level, we can duplicate POI's
{{guess7BitEncoding}} but add the try-catch blocks. I'll open an issue in
POI's bugtracker, though, to fix this at the POI level too.
Test files will be very helpful. If you can share, please do.
was (Author: [email protected]):
The stacktrace is related to my original problem, but actually shows an
inconsistency in POI's handling of {{UnsupportedEncodingException}}. POI has a
try-catch block for that exception only on the first choice for guessing 7 bit
encoding. The second and third choice take whatever value could be pulled out
of the header or the html meta-equiv and {{set7BitEncoding(charset)}} without
the try-catch block.
As an immediate fix at the Tika level, we can duplicate POI's
{{guess7BitEncoding}} but add the try-catch blocks. I'll open an issue in
POI's bugtracker, though, to fix this at the POI level too.
Test files will be very helpful. If you can share, please do.
> Update OutlookExtractor to handle codepage identification more rigorously
> -------------------------------------------------------------------------
>
> Key: TIKA-1238
> URL: https://issues.apache.org/jira/browse/TIKA-1238
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.10
>
>
> Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has
> added more robutst capabilities for identifying codepages in Outlook .msg
> files. As a first step to integrating those improvements, I'll copy and
> paste some of POI's code into OutlookExtractor. As a second step, I'll
> expose more of HSMF's capabilities within POI and then factor out the
> duplicate code in Tika.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)