[
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621192#comment-17621192
]
Tim Allison commented on TIKA-3890:
-----------------------------------
bq. Outside of this, I'm assuming it's okay to send a file stream to tika (like
curl --data-binary <data/file>) instead of uploading the file (like curl -T
<file>) and have it spool the stream to disk based on the spoolToDisk setting.
Is that right?
I'm frankly not sure of the underlying difference. It should all work. :D
> Identifying an efficient approach for getting page count prior to running an
> extraction
> ---------------------------------------------------------------------------------------
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
> Issue Type: Improvement
> Components: app
> Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit
> Reporter: Ethan Wilansky
> Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office
> document with an unreasonably large number of pages with extractable text.
> For example a Word document containing thousands of text pages.
> Unfortunately, we don't have an efficient way to determine page count before
> calling the /tika or /rmeta endpoints and either getting back an array
> allocation error or setting byteArrayMaxOverride to a large number to return
> the text or metadata containing the page count. Returning a result other than
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type:
> application/vnd.openxmlformats-officedocument.wordprocessingml.document"
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
> {{<properties>}}
> {{ <parsers>}}
> {{ <parser class="org.apache.tika.parser.DefaultParser">}}
> {{ <parser-exclude
> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{ <parser-exclude
> class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{ </parser>}}
> {{ <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
> {{ <params>}}
> {{ <param name="byteArrayMaxOverride" type="int">175000000</param>}}
> {{ </params>}}
> {{ </parser>}}
> {{ </parsers>}}
> {{ <server>}}
> {{ <params>}}
> {{ <taskTimeoutMillis>120000</taskTimeoutMillis>}}
> {{ <forkedJvmArgs>}}
> {{ <arg>-Xms2000m</arg>}}
> {{ <arg>-Xmx5000m</arg>}}
> {{ </forkedJvmArgs>}}
> {{ </params>}}
> {{ </server>}}
> {{</properties>}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also
> throw the same array allocation error for very large text extractions?
> 2. Is there any way to correlate the array length returned to the number of
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable
> content in a file before sending it for extraction? It doesn't appear that
> /rmeta with the /ignore path param significantly improves efficiency over
> calling the /tika endpoint or /rmeta w/out /igmore
> If its useful, I can share the 8MB docx file containing 14k pages.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)