Re: [DISCUSS] PDFParser

Timo Boehme Mon, 09 Dec 2013 01:47:45 -0800

Hi,

Am 07.12.2013 13:39, schrieb Maruan Sahyoun:

i (re-) started working on the new PDFParser. The PDFLexer as a foundation - 
together with some tests - is ready so far. Might need some more improvements 
moving forward.


Good news :-)

I'm currently working on the first part of the parser implementation
which is a 'non caching' parser. It generates PD and COS level
objects but only keeps the necessary minimum. e.g. Xref, Trailer ..
but doesn't keep pages, resources … in memory. And on top of that a
"caching" parser which keeps what has being parsed. I don't know if
that's doable but the idea is that applications like merging or
splitting pdfs could benefit from a 'non caching' parser.

Caching could be done using SoftReference - thus it might not benecessary to have the extra level. Nevertheless I can think ofsituations where the different behavior could be of benefit thus maybethe parser should be abstracted (interface etc.) allowing differentimplementations.

The pure COS level parsing is done (e.g. generating a COS Dictionary
form tokens) but there are some additional things needed around
higher level structures e.g. linearized PDFs. Initially the parser
reuses most of the existing classes where possible. Unfortunately
e.g. the COS level classes don't have a common set of methods for
instantiating these.

Question:  Can we agree on how objects are instantiated. e.g. 
Obj.getInstance(token) or new Obj(token) ...

I don't have a specific preference but the factory mentioned byGuillaume is a good idea.

This only makes sense if the objects themselves like pages or
resources can be fully cloned so that if objects are cloned or
imported they no longer have a dependency to the original object.
This could benefit PDF merging as one could close a no longer needed
PDF. This will affect the current PD Model I think.

Question:  Can we already clone, what needs to be done to fulfill that? Could 
we do a importPage() so the imported one is completely independent (and stored 
in memory or in a file based cache)?


I'm not sure but I think a deep clone is not supported today.

As the parser parses the PDF I think about firing events e.g. to
react on malformed PDFs. I consider this to be a better approach than
overwriting methods or putting workarounds into the core code.

I think to see what works best would be to take some workaround exampleswe (should) have now (e.g. finding real object start (lookingback/forth), determining length of a stream or even use information fromscanning file sequentially for object start points) and see how thatcould be realized with the event or another approach. At least to me itseems that these workarounds need to work quite close to the parser soin case of events the handler need to get access to low level functionality.

What about setting up a sandbox to share some initial code wo cluttering the 
current trunk.

A separate branch for developing the parser until a useable state wouldbe good.



Best,
Timo

--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Re: [DISCUSS] PDFParser

Reply via email to