Sounds like we might be extracting that info in the following line in Tika?

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java#L302

On Wed, Mar 17, 2021 at 2:03 PM sahy...@fileaffairs.de
<sahy...@fileaffairs.de> wrote:
>
> Hi Leonard,
>
> attachments won't work at the mailing list - could you upload it to a
> public location or send it to me in person?
>
> BR
> Maruan
>
> Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol:
> > Here is one that I have handy where there is XMP on the image...
> >
> > On 3/17/21, 1:44 PM, "sahy...@fileaffairs.de"
> > <sahy...@fileaffairs.de> wrote:
> >
> >     Hi Leonard,
> >
> >     if you could provide a sample document with XMPs attached to
> > various
> >     PDF objects you're interested in I could come up with a quick
> > sample
> >     for Tim.
> >
> >     BR
> >     Maruan
> >
> >     Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
> >     > Hi Leonard,
> >     >   I'm literally just scraping bytes out of files for now
> > without any
> >     > parsing...so if the XMP is concealed in a compressed stream or
> >     > something more interesting, I'm not grabbing it.  I'm also not
> >     > tracking which XMP is associated with which object.
> >     >   Please forgive me...if I traverse the COSDocument's objects
> > and
> >     > look
> >     > for /Metadata and grab the stream, will that be what you're
> > looking
> >     > for?  Or, is there a commandline tool I can run to get what
> > you're
> >     > interested in?
> >     >   Thank you.
> >     >
> >     >   Cheers,
> >     >
> >     >               Tim
> >     >
> >     > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> >     > <lrose...@adobe.com.invalid> wrote:
> >     > >
> >     > > Are you only pulling document-level XMP?  If so, could you
> > extend
> >     > > it to support object-level metadata as well?   I, for one,
> > would
> >     > > love to get insight into the use of object-level metadata -
> > what
> >     > > objects are they attached to, what are they being used for,
> > etc.
> >     > >
> >     > > Leonard
> >     > >
> >     > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org>
> > wrote:
> >     > >
> >     > >     All,
> >     > >
> >     > >       I'm scraping XMPs out of our corpus and placing them
> > here as
> >     > > standalone files:
> >     > >
> >     > >
> >     > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&amp;reserved=0
> >     > >
> >     > >       I've binned the files roughly based on the container
> > file's
> >     > > mime
> >     > >     type, e.g.
> >     > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&amp;reserved=0
> >     > >
> >     > >       The process is still running, and I view this as a
> > first
> >     > > draft.
> >     > >     Please let me know if there's anything I can do to make
> > these
> >     > > data
> >     > >     easier to use/more useful or if you see any problems.
> >     > >
> >     > >       Cheers,
> >     > >
> >     > >                  Tim
> >     > >
> >
> >     --
> >     --
> >     Maruan Sahyoun
> >
> >     FileAffairs GmbH
> >     Josef-Schappe-Straße 21
> >     40882 Ratingen
> >
> >     Tel: +49 (2102) 89497 88
> >     Fax: +49 (2102) 89497 91
> >     sahy...@fileaffairs.de
> >
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&amp;reserved=0
> >
> >     Geschäftsführer: Maruan Sahyoun
> >     Handelsregister: AG Düsseldorf, HRB 53837
> >     UST.-ID: DE248275827
> >
> >
>
> --
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahy...@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>

Reply via email to