[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177225#comment-16177225 ] Hudson commented on TIKA-2470: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1371 (See [h

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi, On 22/09/17 22:02, Eugene Kirpichov wrote: Sure - with hundreds of different file formats and the abundance of weird / malformed / malicious files in the wild, it's quite expected that sometimes the library will crash. Some kinds of issues are easier to address than others. We can catch exce

[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Konstantin Gribov (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177131#comment-16177131 ] Konstantin Gribov commented on TIKA-2470: - [~talli...@apache.org], speaking of 1.x

[jira] [Comment Edited] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177119#comment-16177119 ] Tim Allison edited comment on TIKA-2470 at 9/22/17 9:03 PM: [~g

[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177119#comment-16177119 ] Tim Allison commented on TIKA-2470: --- [~grossws] I used the JUL Logger in tika-core is thi

[jira] [Resolved] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2470. --- Resolution: Fixed Fix Version/s: 1.17 > Another Illegal reflective Access -- more cleanup for Ja

[jira] [Updated] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2470: -- Description: WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils (file:/C:/data/t

[jira] [Created] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2470: - Summary: Another Illegal reflective Access -- more cleanup for Java 9 Key: TIKA-2470 URL: https://issues.apache.org/jira/browse/TIKA-2470 Project: Tika Issue Type

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi Tim, All On 22/09/17 18:17, Allison, Timothy B. wrote: Y, I think you have it right. Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1],

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Great. Thank you! -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, September 22, 2017 1:46 PM To: dev@tika.apache.org Subject: Re: TikaIO concerns [dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data eithe

Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann
[dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here: http://github.com/chrismattmann/trec-dd-polar/ In case we want to use as part of our regression. Cheers,

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests >>against this TB (thank you, Rackspace for hosting our vm!) to try to identify >>these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some stora

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Nice! Thank you! -Original Message- From: Ben Chambers [mailto:bchamb...@apache.org] Sent: Friday, September 22, 2017 1:24 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns BigQueryIO allows a side-output for elements that failed to be inserted when using

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92 This follows the pattern of

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Do tell... Interesting. Any pointers? -Original Message- From: Ben Chambers [mailto:bchamb...@google.com.INVALID] Sent: Friday, September 22, 2017 12:50 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Regarding specifically elements that are failing --

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Y, I think you have it right. > Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen. We fix the problems or try to get our de

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately. On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov wrote: > Hi Tim, >

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing eff

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>> How will it work now, with new Metadata() passed to the AutoDetect parser, >> will this Metadata have a Metadata value per every attachment, possibly >> keyed by a name ? An example of how to call the RecursiveParserWrapper: https://github.com/apache/tika/blob/master/tika-example/src/main/ja

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi Tim Sorry for getting into the RecursiveParserWrapper discussion first, I was certain the time zone difference was on my side :-) How will it work now, with new Metadata() passed to the AutoDetect parser, will this Metadata have a Metadata value per every attachment, possibly keyed by a n

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi, On 22/09/17 00:42, Eugene Kirpichov wrote: Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. OK, thanks - Yes, the String in KV is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. Sure. If you don't mind then th

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? Not at the moment, we’d have to do some coding on our end or within Beam. The format is a list of maps/dicts for each file. Each map