I created a tika fork example I want to add to the documentation as well: https://github.com/nddipiazza/tika-fork-parser-example
When we submit your fixes, we should update this example with multi-threading. On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Hey Luis, > > It is related because after your fixes I might be able to take some > significant performance advantage by switching to fork parser. > I would make great use of an example of someone else who has set up a > ForkParser multi-thread able processing program that can gracefully handle > the huge onslaught that is my use case. > But at this point, I doubt I'll switch from Tika Server anyways because I > invested some time creating a wrapper around it and it is performing very > well. > > On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lfcnas...@gmail.com> > wrote: > >> Not what you asked but related :) >> >> Luis >> >> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnas...@gmail.com> >> escreveu: >> >> > I've done some few improvements in ForkParser performance in an internal >> > fork. Will try to contribute upstream... >> > >> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < >> > nicholas.dipia...@gmail.com> escreveu: >> > >> >> I am attempting to Tika parse dozens of millions of office documents. >> >> Pdfs, >> >> docs, excels, xmls, etc. Wide assortment of types. >> >> >> >> Throughput is very important. I need to be able parse these files in a >> >> reasonable amount of time, but at the same time, accuracy is also >> pretty >> >> important. I hope to have less than 10% of the documents parsed fail. >> (And >> >> by fail I mean fail due to tika stability, like a timeout while >> parsing. I >> >> do not mean fail due to the document itself). >> >> >> >> My question - how to configure Tika Server in a containerized >> environment >> >> to maximize throughput? >> >> >> >> My environment: >> >> >> >> - I am using Openshift. >> >> - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: >> *8 >> >> GiB to 10 GiB*. >> >> - I have 10 tika parsing pod replicas. >> >> >> >> On each pod, I run a java program where I have 8 parse threads. >> >> >> >> Each thread: >> >> >> >> - Starts a single tika server process (in spawn child mode) >> >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis >> >> 120000 >> >> -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis >> 500 >> >> -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures >> >> -enableFileUrl >> >> - The thread will now continuously grab a file from the >> files-to-fetch >> >> queue and will send it to the tika server, stopping when there are >> no >> >> more >> >> files to parse. >> >> >> >> Each of these files are stored locally on the pod in a buffer, so the >> >> local >> >> file optimization is used: >> >> >> >> The Tika web service it is using is: >> >> >> >> Endpoint: `/rmeta/text` >> >> Method: `PUT` >> >> Headers: - writeLimit = 32000000 - maxEmbeddedResources = 0 - >> >> fileUrl = file:///path/to/file >> >> >> >> Files are no greater than 100Mb, the maximum number of bytes tika text >> >> will >> >> be (writeLimit) 32Mb. >> >> >> >> Each pod is parsing about 370,000 documents per day. I've been messing >> >> with >> >> a ton of different attempts at settings. >> >> >> >> I previously tried to use the actual Tika "ForkParser" but the >> performance >> >> was far worse than spawning tika servers. So that is why I am using >> Tika >> >> Server. >> >> >> >> I don't hate the performance results of this.... but I feel like I'd >> >> better >> >> reach out and make sure there isn't someone out there who sanity >> checks my >> >> numbers and is like "woah that's awful performance, you should be >> getting >> >> xyz like me!" >> >> >> >> Anyone have any similar things you are doing? If so, what settings did >> you >> >> end up settling on? >> >> >> >> Also, I'm wondering if Apache Http Client would be causing any overhead >> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am >> using >> >> a >> >> shared connection pool. Would there be any benefit in say using a >> unique >> >> HttpClients.createDefault() for each thread instead of sharing a >> >> connection >> >> pool between the threads? >> >> >> >> >> >> Cross posted question here as well >> >> >> >> >> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >> >> >> > >> >