Hi Tim,

this is a quick sample code for iterating the pages of a PDF and report
of either that or possible image resources contain metadata.

for (PDPage page : document.getPages())
{
    COSBase metaObj =
page.getCOSObject().getDictionaryObject(COSName.METADATA);
    if (metaObj instanceof COSStream)
    {
        display("found page with metadata", meta);
        meta = new PDMetadata((COSStream) metaObj); 
    }

    PDResources resources = page.getResources();
    for (COSName resName : resources.getXObjectNames())
    {
        PDXObject xObject = resources.getXObject(resName);
        metaObj =
xObject.getCOSObject().getDictionaryObject(COSName.METADATA);
        if (metaObj instanceof COSStream)
        {
            meta = new PDMetadata((COSStream) metaObj);
            display("found image with metadata", meta);
        }
    }
}

This could be extented to report metadata for other resources such as
fonts.

A different approach would be to go low level and get the pages
COSDictionary, look for getDictionaryObject(COSName.METADATA) and
iterate all dictionary keys looking for dictionary objects which
themselves are dictionaries,look for
getDictionaryObject(COSName.METADATA) and so on.

One caveat with that approach is that you need to make sure that you
track the already visited dictionaries as PDF can backwards references.

I could extend the ExtractMetadata example in the PDFBox example code
if that helps you get started. Otherwise please drop a quick note if I
can be of any help.

BR
Maruan


Am Freitag, dem 19.03.2021 um 11:42 -0400 schrieb Tim Allison:
> All,
> 
>     The processes finished: https://corpora.tika.apache.org/base/xmps/
> 
>     Now has two subdirectories, one for the original raw byte scraping
> (1.2 million files with some junk
> https://corpora.tika.apache.org/base/xmps/scraped-xmps/) and one for
> the logical XMPs extracted by ExifTool (450k files
> https://corpora.tika.apache.org/base/xmps/exiftool-xmps/).
> 
>      I plan to write some lightweight code to traverse the DOM and
> look for all /Metadata objects and what they're attached to.
> 
>      If the XMP files are of any use or if they'd be of more use to
> you if we did further processing or packaging, please let me know.
> 
>     Cheers,
> 
>               Tim
> 
> On Wed, Mar 17, 2021 at 4:21 PM Tim Allison <talli...@apache.org>
> wrote:
> > 
> > > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> > > that'd be of any interest.
> > 
> > If there were a commandline or a Java SDK, I could run that next if
> > that'd be of any interest. :D
> > 
> > On Wed, Mar 17, 2021 at 3:28 PM Tim Allison <talli...@apache.org>
> > wrote:
> > > 
> > > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> > > that'd be of any interest.
> > > 
> > > I kicked off a process to run `exifTool -xmp -b` against the files.
> > > The output will go here:
> > > https://corpora.tika.apache.org/base/exiftool-xmps/
> > > 
> > > On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
> > > <lrose...@adobe.com.invalid> wrote:
> > > > 
> > > > Very interesting - thanks.
> > > > 
> > > > FWIW: The XMPToolkit itself has a module called "XMPFiles"
> > > > (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it
> > > > is to read & write/update XMP (and other related metadata such as
> > > > EXIF) from various file formats.  It's what all the Adobe apps
> > > > use to handle XMP in any file format that we encounter.
> > > > 
> > > > Leonard
> > > > 
> > > > On 3/17/21, 2:48 PM, "Tim Allison" <talli...@apache.org> wrote:
> > > > 
> > > >     Wait...I'm sorry...I'm wrong on the first point.
> > > > 
> > > >     1) in Tika generally, we use Jempbox (currently) to parse XMP
> > > > when the
> > > >     parsers come across it and after they select the right one
> > > > and do any
> > > >     joining or other modifications...e.g. the "right" xmp.  We
> > > > use xmpcore
> > > >     for converting other metadata to XMP in our tika-xmp module,
> > > > and
> > > >     xmpcore is a dependency of Drew Noakes' metadata-extractor
> > > > which is
> > > >     critical.
> > > > 
> > > >     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison
> > > > <talli...@apache.org> wrote:
> > > >     >
> > > >     > >Isn't that why are you using the XMP Toolkit???
> > > >     >
> > > >     > Sorry, we may be talking about two different things.
> > > >     >
> > > >     > 1) In Tika generally, we use xmpcore to parse XMP after the
> > > > parsers
> > > >     > extract it and process it (correctly!) from various file
> > > > formats.
> > > >     >
> > > >     > 2) For this exercise, I wanted a quick and dirty byte
> > > > scanner to
> > > >     > extract the raw xmp packets...as much as we could find in
> > > > any file
> > > >     > format without relying on file-format specific parsers.
> > > >     >
> > > >     > I can do a second run where I modify Tika to extract the
> > > > XMP from the
> > > >     > various parsers after they do their processing (determining
> > > > most
> > > >     > recent/joining, etc) to extract the correct XMP.
> > > >     >
> > > >     > And I can do a third run where I modify Tika to extract XMP
> > > > associated
> > > >     > with embedded images in PDFs, for example.
> > > >     >
> > > >     > I hope this clarifies things.  Please let me know what
> > > > would be most
> > > >     > useful for you.
> > > >     >
> > > >     > Cheers,
> > > >     >
> > > >     >        Tim
> > > >     >
> > > >     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> > > >     > <lrose...@adobe.com.invalid> wrote:
> > > >     > >
> > > >     > > >    The other thing is that I wanted to scrape xmp out
> > > > of files beyond PDFs.
> > > >     > > >
> > > >     > > Isn't that why are you using the XMP Toolkit???
> > > >     > >
> > > >     > > Leonard
> > > >     > >
> > > >     > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org>
> > > > wrote:
> > > >     > >
> > > >     > >     > ARGH!!!!   Please don't do this - it will get you
> > > > the wrong results in almost all cases.     Remember that in a PDF
> > > > with updates, there can/will be a new XMP block with each update.
> > > >     > >
> > > >     > >     Ha, right.  I completely understand (perhaps _only_
> > > > this small point
> > > >     > >     on PDFs).  On this pass, my goal was to see what was
> > > > in the file at
> > > >     > >     all, not what was the correct XMP. Part of my
> > > > interest is in what's
> > > >     > >     available in the file, but not available readily to
> > > > the user.
> > > >     > >
> > > >     > >     The other thing is that I wanted to scrape xmp out of
> > > > files beyond PDFs.
> > > >     > >
> > > >     > >     So, I can definitely take a second run where I let a
> > > > PDF tool extract
> > > >     > >     the correct XMP if there's interest in that.
> > > >     > >
> > > >     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> > > >     > >     <lrose...@adobe.com.invalid> wrote:
> > > >     > >     >
> > > >     > >     > >      I'm literally just scraping bytes out of
> > > > files for now without any parsing
> > > >     > >     > >
> > > >     > >     > ARGH!!!!   Please don't do this - it will get you
> > > > the wrong results in almost all cases.     Remember that in a PDF
> > > > with updates, there can/will be a new XMP block with each update.
> > > >     > >     >
> > > >     > >     >
> > > >     > >     > > if I traverse the COSDocument's objects and
> > > > look     for /Metadata and grab the stream, will that be what
> > > > you're looking     for?
> > > >     > >     > >
> > > >     > >     > Just getting those elements would be a great
> > > > start.  If you could also include the rest of the dictionary in
> > > > which it was found (or at least the /Type and /Subtype keys, if
> > > > present) would be great!
> > > >     > >     >
> > > >     > >     > Leonard
> > > >     > >     >
> > > >     > >     > On 3/17/21, 1:39 PM, "Tim Allison"
> > > > <talli...@apache.org> wrote:
> > > >     > >     >
> > > >     > >     >     Hi Leonard,
> > > >     > >     >       I'm literally just scraping bytes out of
> > > > files for now without any
> > > >     > >     >     parsing...so if the XMP is concealed in a
> > > > compressed stream or
> > > >     > >     >     something more interesting, I'm not grabbing
> > > > it.  I'm also not
> > > >     > >     >     tracking which XMP is associated with which
> > > > object.
> > > >     > >     >       Please forgive me...if I traverse the
> > > > COSDocument's objects and look
> > > >     > >     >     for /Metadata and grab the stream, will that be
> > > > what you're looking
> > > >     > >     >     for?  Or, is there a commandline tool I can run
> > > > to get what you're
> > > >     > >     >     interested in?
> > > >     > >     >       Thank you.
> > > >     > >     >
> > > >     > >     >       Cheers,
> > > >     > >     >
> > > >     > >     >                   Tim
> > > >     > >     >
> > > >     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard
> > > > Rosenthol
> > > >     > >     >     <lrose...@adobe.com.invalid> wrote:
> > > >     > >     >     >
> > > >     > >     >     > Are you only pulling document-level XMP?  If
> > > > so, could you extend it to support object-level metadata as
> > > > well?   I, for one, would love to get insight into the use of
> > > > object-level metadata - what objects are they attached to, what
> > > > are they being used for, etc.
> > > >     > >     >     >
> > > >     > >     >     > Leonard
> > > >     > >     >     >
> > > >     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison"
> > > > <talli...@apache.org> wrote:
> > > >     > >     >     >
> > > >     > >     >     >     All,
> > > >     > >     >     >
> > > >     > >     >     >       I'm scraping XMPs out of our corpus and
> > > > placing them here as standalone files:
> > > >     > >     >     >
> > > >     > >     >     >    
> > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
> > > >     > >     >     >
> > > >     > >     >     >       I've binned the files roughly based on
> > > > the container file's mime
> > > >     > >     >     >     type, e.g.
> > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
> > > >     > >     >     >
> > > >     > >     >     >       The process is still running, and I
> > > > view this as a first draft.
> > > >     > >     >     >     Please let me know if there's anything I
> > > > can do to make these data
> > > >     > >     >     >     easier to use/more useful or if you see
> > > > any problems.
> > > >     > >     >     >
> > > >     > >     >     >       Cheers,
> > > >     > >     >     >
> > > >     > >     >     >                  Tim
> > > >     > >     >     >
> > > >     > >     >
> > > >     > >
> > > > 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Reply via email to