Re: [EXTERNAL] Extracting font information from xml

Chris Mattmann Tue, 15 Oct 2019 15:56:55 -0700

When you do a parse, do this:


from tika import parser

parsed = parser.from_file(‘/path/to/file’, xmlContent=True)

xmlContent = parsed[“content”]

print(xmlContent)

 

G’luck!

 

Cheers
Chris

 

 

 

 

From: Jay Chuk <jaychuk2...@gmail.com>
Date: Tuesday, October 15, 2019 at 3:54 PM
To: Chris Mattmann <mattm...@apache.org>
Cc: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: Re: [EXTERNAL] Extracting font information from xml

 

Thanks for the quick reply Chris. 

Please is there a possible code snippet in python for it.

 

Reagrds,

Jay 

 

On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <mattm...@apache.org> wrote:

Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika Server and 
it
provides this functionality. CC’ing dev@tika

 

 

 

From: Jay Chuk <jaychuk2...@gmail.com>
Date: Tuesday, October 15, 2019 at 3:47 PM
To: "Mattmann, Chris A (US 1761)" <chris.a.mattm...@jpl.nasa.gov>
Subject: [EXTERNAL] Extracting font information from xml

 

Hi Chris, 

 

Thanks for provide the python package -Tika, to use for extracting text from 
pdf's.

 

I'll like to confirm it is possible when converting pdf to xml  to get the font 
style for the text e.g the font type, if the text is bold/solid . 

I need such information in identifying section headers and titles from the 
documents.

 

Please let me know if it is possible or if there is another way tp gp about 
this.

 

Thank you

Jay

Re: [EXTERNAL] Extracting font information from xml

Reply via email to