> The other thing is that I wanted to scrape xmp out of files beyond PDFs. > Isn't that why are you using the XMP Toolkit???
Leonard On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote: > ARGH!!!! Please don't do this - it will get you the wrong results in almost all cases. Remember that in a PDF with updates, there can/will be a new XMP block with each update. Ha, right. I completely understand (perhaps _only_ this small point on PDFs). On this pass, my goal was to see what was in the file at all, not what was the correct XMP. Part of my interest is in what's available in the file, but not available readily to the user. The other thing is that I wanted to scrape xmp out of files beyond PDFs. So, I can definitely take a second run where I let a PDF tool extract the correct XMP if there's interest in that. On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol <lrose...@adobe.com.invalid> wrote: > > > I'm literally just scraping bytes out of files for now without any parsing > > > ARGH!!!! Please don't do this - it will get you the wrong results in almost all cases. Remember that in a PDF with updates, there can/will be a new XMP block with each update. > > > > if I traverse the COSDocument's objects and look for /Metadata and grab the stream, will that be what you're looking for? > > > Just getting those elements would be a great start. If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great! > > Leonard > > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote: > > Hi Leonard, > I'm literally just scraping bytes out of files for now without any > parsing...so if the XMP is concealed in a compressed stream or > something more interesting, I'm not grabbing it. I'm also not > tracking which XMP is associated with which object. > Please forgive me...if I traverse the COSDocument's objects and look > for /Metadata and grab the stream, will that be what you're looking > for? Or, is there a commandline tool I can run to get what you're > interested in? > Thank you. > > Cheers, > > Tim > > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol > <lrose...@adobe.com.invalid> wrote: > > > > Are you only pulling document-level XMP? If so, could you extend it to support object-level metadata as well? I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc. > > > > Leonard > > > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> wrote: > > > > All, > > > > I'm scraping XMPs out of our corpus and placing them here as standalone files: > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137263979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xbzilw%2BpDWMnfVCtbMvLoAAMw0dLQM3S4rpli%2B%2BZUtY%3D&reserved=0 > > > > I've binned the files roughly based on the container file's mime > > type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137273937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R%2Fa6VoPWTqcCl52gBP8HLlLzVA5Xb1D4vtg2itxTx30%3D&reserved=0 > > > > The process is still running, and I view this as a first draft. > > Please let me know if there's anything I can do to make these data > > easier to use/more useful or if you see any problems. > > > > Cheers, > > > > Tim > > >