Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Nicholas DiPiazza Thu, 26 Nov 2020 10:00:53 -0800

I created a tika fork example I want to add to the documentation as well:
https://github.com/nddipiazza/tika-fork-parser-example


When we submit your fixes, we should update this example with
multi-threading.

On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> Hey Luis,
>
> It is related because after your fixes I might be able to take some
> significant performance advantage by switching to fork parser.
> I would make great use of an example of someone else who has set up a
> ForkParser multi-thread able processing program that can gracefully handle
> the huge onslaught that is my use case.
> But at this point, I doubt I'll switch from Tika Server anyways because I
> invested some time creating a wrapper around it and it is performing very
> well.
>
> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lfcnas...@gmail.com>
> wrote:
>
>> Not what you asked but related :)
>>
>> Luis
>>
>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnas...@gmail.com>
>> escreveu:
>>
>> > I've done some few improvements in ForkParser performance in an internal
>> > fork. Will try to contribute upstream...
>> >
>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>> > nicholas.dipia...@gmail.com> escreveu:
>> >
>> >> I am attempting to Tika parse dozens of millions of office documents.
>> >> Pdfs,
>> >> docs, excels, xmls, etc. Wide assortment of types.
>> >>
>> >> Throughput is very important. I need to be able parse these files in a
>> >> reasonable amount of time, but at the same time, accuracy is also
>> pretty
>> >> important. I hope to have less than 10% of the documents parsed fail.
>> (And
>> >> by fail I mean fail due to tika stability, like a timeout while
>> parsing. I
>> >> do not mean fail due to the document itself).
>> >>
>> >> My question - how to configure Tika Server in a containerized
>> environment
>> >> to maximize throughput?
>> >>
>> >> My environment:
>> >>
>> >>    - I am using Openshift.
>> >>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>> *8
>> >>    GiB to 10 GiB*.
>> >>    - I have 10 tika parsing pod replicas.
>> >>
>> >> On each pod, I run a java program where I have 8 parse threads.
>> >>
>> >> Each thread:
>> >>
>> >>    - Starts a single tika server process (in spawn child mode)
>> >>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> >> 120000
>> >>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis
>> 500
>> >>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
>> >> -enableFileUrl
>> >>    - The thread will now continuously grab a file from the
>> files-to-fetch
>> >>    queue and will send it to the tika server, stopping when there are
>> no
>> >> more
>> >>    files to parse.
>> >>
>> >> Each of these files are stored locally on the pod in a buffer, so the
>> >> local
>> >> file optimization is used:
>> >>
>> >> The Tika web service it is using is:
>> >>
>> >> Endpoint: `/rmeta/text`
>> >> Method: `PUT`
>> >> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
>> >> fileUrl = file:///path/to/file
>> >>
>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>> >> will
>> >> be (writeLimit) 32Mb.
>> >>
>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>> >> with
>> >> a ton of different attempts at settings.
>> >>
>> >> I previously tried to use the actual Tika "ForkParser" but the
>> performance
>> >> was far worse than spawning tika servers. So that is why I am using
>> Tika
>> >> Server.
>> >>
>> >> I don't hate the performance results of this.... but I feel like I'd
>> >> better
>> >> reach out and make sure there isn't someone out there who sanity
>> checks my
>> >> numbers and is like "woah that's awful performance, you should be
>> getting
>> >> xyz like me!"
>> >>
>> >> Anyone have any similar things you are doing? If so, what settings did
>> you
>> >> end up settling on?
>> >>
>> >> Also, I'm wondering if Apache Http Client would be causing any overhead
>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>> using
>> >> a
>> >> shared connection pool. Would there be any benefit in say using a
>> unique
>> >> HttpClients.createDefault() for each thread instead of sharing a
>> >> connection
>> >> pool between the threads?
>> >>
>> >>
>> >> Cross posted question here as well
>> >>
>> >>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>> >>
>> >
>>
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Reply via email to