Re: Integrating Tika with Apache Beam

Sergey Beryozkin Mon, 11 Sep 2017 07:54:43 -0700

Hi Tim

Thanks, the code, especially the one dealing with adapting the Tikaevents to the Bean pipeline will most likely need to be improved :-),I've tried to make sure it all can be configured as much as possible(point to the loc of the TikaConfig if needed, etc), but it's only astart...I already see a typo in the TikaOptions doc for the minimum text length,time to create a new PR :-)


Cheers, Sergey
On 11/09/17 15:33, Allison, Timothy B. wrote:

What great news!  Thank you, Sergey!!!

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Monday, September 11, 2017 9:18 AM
To: Allison, Timothy B. <talli...@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Tim, All

It took it some time, but finally Beam TikaIO component is in its 
2.2.0-SNAPSHOT master,

https://github.com/apache/beam/tree/master/sdks/java/io/tika

I've created a basic project which can help with running it quickly:

https://github.com/sberyozkin/beamTikaExample

One can just build it and run as suggested in Readme.md, simply have some PDF 
files for example, and point to one or all of them.

By default, Beam will output the data to /tmp/tika.

main() can be updated with supporting more options, they can be collected from 
the command line either with TikaOptions:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java

(all options but the "--input" are optional)

or directly from the code, some variations are shown in the tests:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java

By default TikaReader will use an internal queue to make the SAX events available to the 
Beam pipeline, this is why you can see the options like "queuePollTime", etc. 
If it's known that a given parser can really read the whole text in the single op only 
then the process can be optimized with 'parseSynchronously'...

One can also try to update main() in the example to do more interesting things 
then just print the data :-).

Give it a try please if you get a chance, help make TikeIO the major part of 
Beam :-) with PRs, etc

Thanks, Sergey

On 25/05/17 17:47, Sergey Beryozkin wrote:

Hi Guys

The link to the initial code is available in JIRA, at this stage the
focus is on preparing a solid initial PR, and then we can all improve
Tika related code :-)

Cheers, Sergey
On 24/05/17 11:41, Sergey Beryozkin wrote:

Hi Tim, All,

I thought I'd start a dedicated thread.

I added some initial comments to [1], I'm quite close now to creating
the initial PR.

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/BEAM-2328
On 23/05/17 17:42, Allison, Timothy B. wrote:

Another idea...if you have any interest, it would be great to get
Apache Beam set up on our Rackspace VM (with Spark?) and use it for
our regression tests?

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, May 19, 2017 4:21 PM
To: u...@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs

Hi Tim

Sure, once I get an initial PR ready I'll send an update and I'll
explain what I did for a start and we will discuss it further

Re: Integrating Tika with Apache Beam

Reply via email to