Sounds like we might be extracting that info in the following line in Tika?
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java#L302 On Wed, Mar 17, 2021 at 2:03 PM sahy...@fileaffairs.de <sahy...@fileaffairs.de> wrote: > > Hi Leonard, > > attachments won't work at the mailing list - could you upload it to a > public location or send it to me in person? > > BR > Maruan > > Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol: > > Here is one that I have handy where there is XMP on the image... > > > > On 3/17/21, 1:44 PM, "sahy...@fileaffairs.de" > > <sahy...@fileaffairs.de> wrote: > > > > Hi Leonard, > > > > if you could provide a sample document with XMPs attached to > > various > > PDF objects you're interested in I could come up with a quick > > sample > > for Tim. > > > > BR > > Maruan > > > > Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison: > > > Hi Leonard, > > > I'm literally just scraping bytes out of files for now > > without any > > > parsing...so if the XMP is concealed in a compressed stream or > > > something more interesting, I'm not grabbing it. I'm also not > > > tracking which XMP is associated with which object. > > > Please forgive me...if I traverse the COSDocument's objects > > and > > > look > > > for /Metadata and grab the stream, will that be what you're > > looking > > > for? Or, is there a commandline tool I can run to get what > > you're > > > interested in? > > > Thank you. > > > > > > Cheers, > > > > > > Tim > > > > > > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol > > > <lrose...@adobe.com.invalid> wrote: > > > > > > > > Are you only pulling document-level XMP? If so, could you > > extend > > > > it to support object-level metadata as well? I, for one, > > would > > > > love to get insight into the use of object-level metadata - > > what > > > > objects are they attached to, what are they being used for, > > etc. > > > > > > > > Leonard > > > > > > > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> > > wrote: > > > > > > > > All, > > > > > > > > I'm scraping XMPs out of our corpus and placing them > > here as > > > > standalone files: > > > > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&reserved=0 > > > > > > > > I've binned the files roughly based on the container > > file's > > > > mime > > > > type, e.g. > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&reserved=0 > > > > > > > > The process is still running, and I view this as a > > first > > > > draft. > > > > Please let me know if there's anything I can do to make > > these > > > > data > > > > easier to use/more useful or if you see any problems. > > > > > > > > Cheers, > > > > > > > > Tim > > > > > > > > -- > > -- > > Maruan Sahyoun > > > > FileAffairs GmbH > > Josef-Schappe-Straße 21 > > 40882 Ratingen > > > > Tel: +49 (2102) 89497 88 > > Fax: +49 (2102) 89497 91 > > sahy...@fileaffairs.de > > > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&reserved=0 > > > > Geschäftsführer: Maruan Sahyoun > > Handelsregister: AG Düsseldorf, HRB 53837 > > UST.-ID: DE248275827 > > > > > > -- > -- > Maruan Sahyoun > > FileAffairs GmbH > Josef-Schappe-Straße 21 > 40882 Ratingen > > Tel: +49 (2102) 89497 88 > Fax: +49 (2102) 89497 91 > sahy...@fileaffairs.de > www.fileaffairs.de > > Geschäftsführer: Maruan Sahyoun > Handelsregister: AG Düsseldorf, HRB 53837 > UST.-ID: DE248275827 >