RE: Is Apache PDFBox based on the Arlington PDF Model? ...

Peter Wyatt Sat, 08 Oct 2022 22:35:04 -0700

The Arlington PDF Model is all about the model data, NOT the code. The code 
artifacts are merely PoC hacks used for prototyping and assessing the 
capabilities and expressiveness of the model itself. The model is a standalone 
machine-readable definition of every PDF object defined in the ISO PDF 2.0 
specification including many data integrity relationships. The model itself is 
also continuing to grow and expand in terms of scope and expressiveness.


For an existing and mature project like Apache PDFBox, the value of the data 
model will most likely lie in checking implementation details (e.g. 
required-ness of keys, valid data ranges, etc), test case identification and/or 
generation, or debugging. Undoubtedly Apache PDFBox will also have grown some 
level of "permissiveness" to account for real-world malformations found in PDFs 
- the Arlington PDF Model also provides a means by which such permissiveness 
can be defined and documented as extensions to the official ISO baseline (the 
nominal "ground truth" for PDF).

A significant update is also about to occur to the Arlington PDF Model master 
branch as a result of use and adoption by others... see the "Extensions" branch.

> -----Original Message-----
> From: Jason Pyeron <jpye...@pdinc.us>
> Sent: Sunday, 9 October 2022 1:57 AM
> To: users@pdfbox.apache.org
> Subject: RE: Is Apache PDFBox based on the Arlington PDF Model? ...
> 
> > -----Original Message-----
> > From: Albretch Mueller
> > Sent: Saturday, October 8, 2022 9:24 AM
> >
> >  https://github.com/pdf-association/arlington-pdf-model/
> 
> Interesting project form the PDF Association.
> 
> >
> >  For whatever reason I (wrongly?) thought that to be the case:
> >
> >  https://en.wikipedia.org/wiki/Apache_PDFBox
> >
> >  https://en.wikipedia.org/wiki/COCOMO
> >
> 
> What does COCOMO have to do with topic of the question?
> 
> >  But I am not sure if it makes any functional sense anyway.
> >
> >  I think it should be relatively easy and easily maintainable to code
> > around that model, which makes me wonder why hasn't a project been
> > started based on such baselines ideas.
> 
> To start with, looking at their initial commit to understand their point(S) 
> of view and development vector:
> 
> commit a512182b24419a8b71895e262135f937ed22f1f9
> Author: Roman Toda <t...@digitaldocuments.org>
> Date:   Tue Feb 4 13:52:49 2020 +0100
> 
>     initial commit
> 
> There are several DLL files and mostly C code - very windows centric 
> development and not about reading/writing
> PDFs in Java.
> 
> Also this started many years after PDFBox. So to answer the question in the 
> subject, No. Apache PDFBox cannot be
> based on the Arlington PDF Model since PDFBox v1 was started in 2008 and v2 
> was first released in 2015.
> 
> A quick search of every commit on or before r1904460 (12a38bf88) for 
> 'Arlington' has no results.
> 
> Next let's look at their java code:
> 
> $ find -name '*.java'
> ./gcxml/src/gcxml/Gcxml.java
> ./gcxml/src/gcxml/TSVHandler.java
> ./gcxml/src/gcxml/XMLCreator.java
> ./gcxml/src/gcxml/XMLQuery.java
> 
> Not very much, just seems to be their GC XML program. From the readme.md:
> 
> GXCML - Java PoC utlity
> Java-based proof of concept CLI utility that can:
> 
> convert an Arlington TSV file set into PDF version specific subsets (also as 
> TSV)
> 
> In summary, not sure how or why any of this would be applicable to PDFBox.
> 
> -Jason
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

RE: Is Apache PDFBox based on the Arlington PDF Model? ...

Reply via email to