Hi Chris - thank you for forwarding the request! Once the team has reviewed
I'll give it another try.

Thank you,
Hannah

On Wed, May 26, 2021 at 5:24 PM Chris Mattmann <mattm...@apache.org> wrote:

> Hannah, I am pushing your question upstream to the dev@tika list. I think
> what you need is for them to look
> at your config file which I’ve reattached below pasted, and then see if it
> looks ok. Then in Tika Python you need
> to give it this config file before your server starts up or outside of
> Python just start your server with this config
> file working, then Tika Python will pick it up:
>
>
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>     <parsers>
>
>         <!-- Exclude default values -->
>
>         <parser class="org.apache.tika.parser.DefaultParser">
>
> <!--            <property-exclude name = "sortByPosition"/>-->
>
>             <mime-exclude>application/pdf</mime-exclude>
>
>         </parser>
>
>         <!-- Ensure that sorts by position -->
>
>         <parser class="org.apache.tika.parser.EmptyParser">
>
>             <mime>application/pdf</mime>
>
>             <property name="sortByPosition" value="true"/>
>
>         </parser>
>
>     </parsers>
>
> </properties>
>
>
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> *From: *Hannah Eli <elihann...@gmail.com>
> *Date: *Wednesday, May 26, 2021 at 1:47 PM
> *To: *"Mattmann, Chris A (US 1740)" <chris.a.mattm...@jpl.nasa.gov>
> *Subject: *[EXTERNAL] Question on custom tika-python configs for OMB PDF
>
>
>
> Hi Chris,
>
>
>
> Hope you're well. I'm trying to use tika to parse the table of contents
> for the Office of Management and Budget's A-11 Circular PDF
> <https://urldefense.us/v3/__https:/www.whitehouse.gov/wp-content/uploads/2018/06/a11_web_toc.pdf__;!!PvBDto6Hs4WbVuu7!aHaS3pr3WwzObTFHgaGkqMCJppTbQKWTCHqYM3RU4jHtF7_QT2I398YFRJBbMCfLWTVf_0yR9A$>
>  (I
> know it's short enough to parse manually, but we're building a repeatable
> extract). When I do so, the text is parsed out of order. I was trying to
> fix this by creating a custom config file with the sortbyPosition property
> (see attached), but I'm not an XML guru and don't believe it's working
> properly. I've also tried changing the Windows environment variables to
> point to this file.
>
>
>
> Any guidance would be much appreciated.
>
>
>
> Thank you!
>
> Hannah
>
>
>
> --
>
> *Hannah Eli*
>


-- 
*Hannah Eli*
(317) 656-1366 | elihann...@gmail.com

Reply via email to