[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

Tim Allison (Jira) Thu, 20 Oct 2022 08:32:13 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621192#comment-17621192
 ]


Tim Allison commented on TIKA-3890:
-----------------------------------

bq. Outside of this, I'm assuming it's okay to send a file stream to tika (like 
curl --data-binary <data/file>) instead of uploading the file (like curl -T 
<file>) and have it spool the stream to disk based on the spoolToDisk setting. 
Is that right?

I'm frankly not sure of the underlying difference.  It should all work. :D

> Identifying an efficient approach for getting page count prior to running an 
> extraction
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-3890
>                 URL: https://issues.apache.org/jira/browse/TIKA-3890
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>
> Tika is doing a great job with text extraction, until we encounter an Office 
> document with an  unreasonably large number of pages with extractable text. 
> For example a Word document containing thousands of text pages. 
> Unfortunately, we don't have an efficient way to determine page count before 
> calling the /tika or /rmeta endpoints and either getting back an array 
> allocation error or setting  byteArrayMaxOverride to a large number to return 
> the text or metadata containing the page count. Returning a result other than 
> the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
> [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
> {{<properties>}}
> {{  <parsers>}}
> {{    <parser class="org.apache.tika.parser.DefaultParser">}}
> {{      <parser-exclude 
> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{      <parser-exclude 
> class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    </parser>}}
> {{    <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
> {{      <params>}}
> {{        <param name="byteArrayMaxOverride" type="int">175000000</param>}}
> {{      </params>}}
> {{    </parser>}}
> {{  </parsers>}}
> {{  <server>}}
> {{    <params>}}
> {{      <taskTimeoutMillis>120000</taskTimeoutMillis>}}
> {{      <forkedJvmArgs>}}
> {{        <arg>-Xms2000m</arg>}}
> {{        <arg>-Xmx5000m</arg>}}
> {{      </forkedJvmArgs>}}
> {{    </params>}}
> {{  </server>}}
> {{</properties>}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I 
> don't configure {{byteArrayMaxOverride}} I get this exception in just over a 
> second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length 
> for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer 
> these questions?
> 1. Will other extractable file types that don't use the OfficeParser also 
> throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of 
> lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable 
> content in a file before sending it for extraction? It doesn't appear that 
> /rmeta with the /ignore path param significantly improves efficiency over 
> calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

Reply via email to