Hi, Am 07.12.2013 13:39, schrieb Maruan Sahyoun:
i (re-) started working on the new PDFParser. The PDFLexer as a foundation - together with some tests - is ready so far. Might need some more improvements moving forward.
Good news :-)
I'm currently working on the first part of the parser implementation which is a 'non caching' parser. It generates PD and COS level objects but only keeps the necessary minimum. e.g. Xref, Trailer .. but doesn't keep pages, resources … in memory. And on top of that a "caching" parser which keeps what has being parsed. I don't know if that's doable but the idea is that applications like merging or splitting pdfs could benefit from a 'non caching' parser.
Caching could be done using SoftReference - thus it might not be necessary to have the extra level. Nevertheless I can think of situations where the different behavior could be of benefit thus maybe the parser should be abstracted (interface etc.) allowing different implementations.
The pure COS level parsing is done (e.g. generating a COS Dictionary form tokens) but there are some additional things needed around higher level structures e.g. linearized PDFs. Initially the parser reuses most of the existing classes where possible. Unfortunately e.g. the COS level classes don't have a common set of methods for instantiating these. Question: Can we agree on how objects are instantiated. e.g. Obj.getInstance(token) or new Obj(token) ...
I don't have a specific preference but the factory mentioned by Guillaume is a good idea.
This only makes sense if the objects themselves like pages or resources can be fully cloned so that if objects are cloned or imported they no longer have a dependency to the original object. This could benefit PDF merging as one could close a no longer needed PDF. This will affect the current PD Model I think. Question: Can we already clone, what needs to be done to fulfill that? Could we do a importPage() so the imported one is completely independent (and stored in memory or in a file based cache)?
I'm not sure but I think a deep clone is not supported today.
As the parser parses the PDF I think about firing events e.g. to react on malformed PDFs. I consider this to be a better approach than overwriting methods or putting workarounds into the core code.
I think to see what works best would be to take some workaround examples we (should) have now (e.g. finding real object start (looking back/forth), determining length of a stream or even use information from scanning file sequentially for object start points) and see how that could be realized with the event or another approach. At least to me it seems that these workarounds need to work quite close to the parser so in case of events the handler need to get access to low level functionality.
What about setting up a sandbox to share some initial code wo cluttering the current trunk.
A separate branch for developing the parser until a useable state would be good.
Best, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _____________________________________________________________________ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _____________________________________________________________________