Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
that'd be of any interest.

I kicked off a process to run `exifTool -xmp -b` against the files.
The output will go here:
https://corpora.tika.apache.org/base/exiftool-xmps/

On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
<lrose...@adobe.com.invalid> wrote:
>
> Very interesting - thanks.
>
> FWIW: The XMPToolkit itself has a module called "XMPFiles" 
> (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & 
> write/update XMP (and other related metadata such as EXIF) from various file 
> formats.  It's what all the Adobe apps use to handle XMP in any file format 
> that we encounter.
>
> Leonard
>
> On 3/17/21, 2:48 PM, "Tim Allison" <talli...@apache.org> wrote:
>
>     Wait...I'm sorry...I'm wrong on the first point.
>
>     1) in Tika generally, we use Jempbox (currently) to parse XMP when the
>     parsers come across it and after they select the right one and do any
>     joining or other modifications...e.g. the "right" xmp.  We use xmpcore
>     for converting other metadata to XMP in our tika-xmp module, and
>     xmpcore is a dependency of Drew Noakes' metadata-extractor which is
>     critical.
>
>     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <talli...@apache.org> wrote:
>     >
>     > >Isn't that why are you using the XMP Toolkit???
>     >
>     > Sorry, we may be talking about two different things.
>     >
>     > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
>     > extract it and process it (correctly!) from various file formats.
>     >
>     > 2) For this exercise, I wanted a quick and dirty byte scanner to
>     > extract the raw xmp packets...as much as we could find in any file
>     > format without relying on file-format specific parsers.
>     >
>     > I can do a second run where I modify Tika to extract the XMP from the
>     > various parsers after they do their processing (determining most
>     > recent/joining, etc) to extract the correct XMP.
>     >
>     > And I can do a third run where I modify Tika to extract XMP associated
>     > with embedded images in PDFs, for example.
>     >
>     > I hope this clarifies things.  Please let me know what would be most
>     > useful for you.
>     >
>     > Cheers,
>     >
>     >        Tim
>     >
>     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
>     > <lrose...@adobe.com.invalid> wrote:
>     > >
>     > > >    The other thing is that I wanted to scrape xmp out of files 
> beyond PDFs.
>     > > >
>     > > Isn't that why are you using the XMP Toolkit???
>     > >
>     > > Leonard
>     > >
>     > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote:
>     > >
>     > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
> results in almost all cases.     Remember that in a PDF with updates, there 
> can/will be a new XMP block with each update.
>     > >
>     > >     Ha, right.  I completely understand (perhaps _only_ this small 
> point
>     > >     on PDFs).  On this pass, my goal was to see what was in the file 
> at
>     > >     all, not what was the correct XMP. Part of my interest is in 
> what's
>     > >     available in the file, but not available readily to the user.
>     > >
>     > >     The other thing is that I wanted to scrape xmp out of files 
> beyond PDFs.
>     > >
>     > >     So, I can definitely take a second run where I let a PDF tool 
> extract
>     > >     the correct XMP if there's interest in that.
>     > >
>     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
>     > >     <lrose...@adobe.com.invalid> wrote:
>     > >     >
>     > >     > >      I'm literally just scraping bytes out of files for now 
> without any parsing
>     > >     > >
>     > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
> results in almost all cases.     Remember that in a PDF with updates, there 
> can/will be a new XMP block with each update.
>     > >     >
>     > >     >
>     > >     > > if I traverse the COSDocument's objects and look     for 
> /Metadata and grab the stream, will that be what you're looking     for?
>     > >     > >
>     > >     > Just getting those elements would be a great start.  If you 
> could also include the rest of the dictionary in which it was found (or at 
> least the /Type and /Subtype keys, if present) would be great!
>     > >     >
>     > >     > Leonard
>     > >     >
>     > >     > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote:
>     > >     >
>     > >     >     Hi Leonard,
>     > >     >       I'm literally just scraping bytes out of files for now 
> without any
>     > >     >     parsing...so if the XMP is concealed in a compressed stream 
> or
>     > >     >     something more interesting, I'm not grabbing it.  I'm also 
> not
>     > >     >     tracking which XMP is associated with which object.
>     > >     >       Please forgive me...if I traverse the COSDocument's 
> objects and look
>     > >     >     for /Metadata and grab the stream, will that be what you're 
> looking
>     > >     >     for?  Or, is there a commandline tool I can run to get what 
> you're
>     > >     >     interested in?
>     > >     >       Thank you.
>     > >     >
>     > >     >       Cheers,
>     > >     >
>     > >     >                   Tim
>     > >     >
>     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
>     > >     >     <lrose...@adobe.com.invalid> wrote:
>     > >     >     >
>     > >     >     > Are you only pulling document-level XMP?  If so, could 
> you extend it to support object-level metadata as well?   I, for one, would 
> love to get insight into the use of object-level metadata - what objects are 
> they attached to, what are they being used for, etc.
>     > >     >     >
>     > >     >     > Leonard
>     > >     >     >
>     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> 
> wrote:
>     > >     >     >
>     > >     >     >     All,
>     > >     >     >
>     > >     >     >       I'm scraping XMPs out of our corpus and placing 
> them here as standalone files:
>     > >     >     >
>     > >     >     >     
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
>     > >     >     >
>     > >     >     >       I've binned the files roughly based on the 
> container file's mime
>     > >     >     >     type, e.g. 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
>     > >     >     >
>     > >     >     >       The process is still running, and I view this as a 
> first draft.
>     > >     >     >     Please let me know if there's anything I can do to 
> make these data
>     > >     >     >     easier to use/more useful or if you see any problems.
>     > >     >     >
>     > >     >     >       Cheers,
>     > >     >     >
>     > >     >     >                  Tim
>     > >     >     >
>     > >     >
>     > >
>

Reply via email to