workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

Uwe Schindler (JIRA) Thu, 22 Jan 2015 09:48:53 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287850#comment-14287850
 ]


Uwe Schindler commented on TIKA-1526:
-------------------------------------

There is also a second problem: The bug is in the static \{\} initializer of 
the UNIXProcess class. So it happens only when the class is loaded first. If it 
was loaded correctly, the class is initialized with right settings and passes 
(in fact, also with turkish locale). But if it fails for the first time, 
UNIXProcess is broken for the whole lifetime of the JVM (also with good 
locales). Because UNIXProcess failed to initialize, the JVM marks it as broken 
and you get a NoClassDefFound error.

The problem does not happen on Linux, because the default value of the 
problematic system property is there initialized with some other value, that 
does not contain "i" which is affected by the famous upper/lowercasing bug: 
http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

So if some other test is executing some external process before and has 
non-Turkish locale, also calls with turkish succeed. Because of that we test 
Lucene with all possible Locales set before the JVM is starting. We don't 
switch the locale actively during tests.

> ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so 
> Turkish Tika users can still use non-external parsers
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1526
>                 URL: https://issues.apache.org/jira/browse/TIKA-1526
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Hoss Man
>
> the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" 
> lowercasing being one of them...
> https://bugs.openjdk.java.net/browse/JDK-8047340
> https://bugs.openjdk.java.net/browse/JDK-8055301
> As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
> enabled & configured by default in Tika, and uses ExternalParser.check to see 
> if tesseract is available -- but because of the JDK bug, this means that Tika 
> fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
> so...
> {noformat}
>   [junit4]    > Throwable #1: java.lang.Error: posix_spawn is not a supported 
> process launch mechanism on this platform.
>   [junit4]    >       at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
>   [junit4]    >       at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
>   [junit4]    >       at java.security.AccessController.doPrivileged(Native 
> Method)
>   [junit4]    >       at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92)
>   [junit4]    >       at java.lang.ProcessImpl.start(ProcessImpl.java:130)
>   [junit4]    >       at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   [junit4]    >       at java.lang.Runtime.exec(Runtime.java:620)
>   [junit4]    >       at java.lang.Runtime.exec(Runtime.java:485)
>   [junit4]    >       at 
> org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
>   [junit4]    >       at 
> org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
>   [junit4]    >       at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]    >       at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   [junit4]    >       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> {noformat}
> ...unless they go out of their way to white list only the parsers they 
> need/want so TesseractOCRParser (and any other ExternalParsers) will never 
> even be check()ed.
> It would be nice if Tika's ExternalParser class added a similar 
> hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
>  In Solr we just propogate a better error explaining why Java hates the 
> turkish langauge...
> {code}
> } catch (Error err) {
>   if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") 
> || err.getMessage().contains("UNIXProcess"))) {
>     log.warn("Error forking command due to JVM locale bug (see 
> https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage());
>     return "(error executing: " + cmd + ")";
>   }
> }
> {code}
> ...but with Tika, it might be better for all ExternalParsers to just "opt 
> out" as if they don't recognize the filetype when they detect this type of 
> error fro m the check method (or perhaps it would be better if 
> AutoDetectParser handled this? ... i'm not really sure how it would best fit 
> into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

Reply via email to