Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd be of any interest.
I kicked off a process to run `exifTool -xmp -b` against the files. The output will go here: https://corpora.tika.apache.org/base/exiftool-xmps/ On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol <lrose...@adobe.com.invalid> wrote: > > Very interesting - thanks. > > FWIW: The XMPToolkit itself has a module called "XMPFiles" > (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & > write/update XMP (and other related metadata such as EXIF) from various file > formats. It's what all the Adobe apps use to handle XMP in any file format > that we encounter. > > Leonard > > On 3/17/21, 2:48 PM, "Tim Allison" <talli...@apache.org> wrote: > > Wait...I'm sorry...I'm wrong on the first point. > > 1) in Tika generally, we use Jempbox (currently) to parse XMP when the > parsers come across it and after they select the right one and do any > joining or other modifications...e.g. the "right" xmp. We use xmpcore > for converting other metadata to XMP in our tika-xmp module, and > xmpcore is a dependency of Drew Noakes' metadata-extractor which is > critical. > > On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <talli...@apache.org> wrote: > > > > >Isn't that why are you using the XMP Toolkit??? > > > > Sorry, we may be talking about two different things. > > > > 1) In Tika generally, we use xmpcore to parse XMP after the parsers > > extract it and process it (correctly!) from various file formats. > > > > 2) For this exercise, I wanted a quick and dirty byte scanner to > > extract the raw xmp packets...as much as we could find in any file > > format without relying on file-format specific parsers. > > > > I can do a second run where I modify Tika to extract the XMP from the > > various parsers after they do their processing (determining most > > recent/joining, etc) to extract the correct XMP. > > > > And I can do a third run where I modify Tika to extract XMP associated > > with embedded images in PDFs, for example. > > > > I hope this clarifies things. Please let me know what would be most > > useful for you. > > > > Cheers, > > > > Tim > > > > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol > > <lrose...@adobe.com.invalid> wrote: > > > > > > > The other thing is that I wanted to scrape xmp out of files > beyond PDFs. > > > > > > > Isn't that why are you using the XMP Toolkit??? > > > > > > Leonard > > > > > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote: > > > > > > > ARGH!!!! Please don't do this - it will get you the wrong > results in almost all cases. Remember that in a PDF with updates, there > can/will be a new XMP block with each update. > > > > > > Ha, right. I completely understand (perhaps _only_ this small > point > > > on PDFs). On this pass, my goal was to see what was in the file > at > > > all, not what was the correct XMP. Part of my interest is in > what's > > > available in the file, but not available readily to the user. > > > > > > The other thing is that I wanted to scrape xmp out of files > beyond PDFs. > > > > > > So, I can definitely take a second run where I let a PDF tool > extract > > > the correct XMP if there's interest in that. > > > > > > On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol > > > <lrose...@adobe.com.invalid> wrote: > > > > > > > > > I'm literally just scraping bytes out of files for now > without any parsing > > > > > > > > > ARGH!!!! Please don't do this - it will get you the wrong > results in almost all cases. Remember that in a PDF with updates, there > can/will be a new XMP block with each update. > > > > > > > > > > > > > if I traverse the COSDocument's objects and look for > /Metadata and grab the stream, will that be what you're looking for? > > > > > > > > > Just getting those elements would be a great start. If you > could also include the rest of the dictionary in which it was found (or at > least the /Type and /Subtype keys, if present) would be great! > > > > > > > > Leonard > > > > > > > > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote: > > > > > > > > Hi Leonard, > > > > I'm literally just scraping bytes out of files for now > without any > > > > parsing...so if the XMP is concealed in a compressed stream > or > > > > something more interesting, I'm not grabbing it. I'm also > not > > > > tracking which XMP is associated with which object. > > > > Please forgive me...if I traverse the COSDocument's > objects and look > > > > for /Metadata and grab the stream, will that be what you're > looking > > > > for? Or, is there a commandline tool I can run to get what > you're > > > > interested in? > > > > Thank you. > > > > > > > > Cheers, > > > > > > > > Tim > > > > > > > > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol > > > > <lrose...@adobe.com.invalid> wrote: > > > > > > > > > > Are you only pulling document-level XMP? If so, could > you extend it to support object-level metadata as well? I, for one, would > love to get insight into the use of object-level metadata - what objects are > they attached to, what are they being used for, etc. > > > > > > > > > > Leonard > > > > > > > > > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> > wrote: > > > > > > > > > > All, > > > > > > > > > > I'm scraping XMPs out of our corpus and placing > them here as standalone files: > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&reserved=0 > > > > > > > > > > I've binned the files roughly based on the > container file's mime > > > > > type, e.g. > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&reserved=0 > > > > > > > > > > The process is still running, and I view this as a > first draft. > > > > > Please let me know if there's anything I can do to > make these data > > > > > easier to use/more useful or if you see any problems. > > > > > > > > > > Cheers, > > > > > > > > > > Tim > > > > > > > > > > > > >