>      I'm literally just scraping bytes out of files for now without any 
> parsing
>
ARGH!!!!   Please don't do this - it will get you the wrong results in almost 
all cases.     Remember that in a PDF with updates, there can/will be a new XMP 
block with each update.


> if I traverse the COSDocument's objects and look     for /Metadata and grab 
> the stream, will that be what you're looking     for?
>
Just getting those elements would be a great start.  If you could also include 
the rest of the dictionary in which it was found (or at least the /Type and 
/Subtype keys, if present) would be great!

Leonard

On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote:

    Hi Leonard,
      I'm literally just scraping bytes out of files for now without any
    parsing...so if the XMP is concealed in a compressed stream or
    something more interesting, I'm not grabbing it.  I'm also not
    tracking which XMP is associated with which object.
      Please forgive me...if I traverse the COSDocument's objects and look
    for /Metadata and grab the stream, will that be what you're looking
    for?  Or, is there a commandline tool I can run to get what you're
    interested in?
      Thank you.

      Cheers,

                  Tim

    On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    <lrose...@adobe.com.invalid> wrote:
    >
    > Are you only pulling document-level XMP?  If so, could you extend it to 
support object-level metadata as well?   I, for one, would love to get insight 
into the use of object-level metadata - what objects are they attached to, what 
are they being used for, etc.
    >
    > Leonard
    >
    > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> wrote:
    >
    >     All,
    >
    >       I'm scraping XMPs out of our corpus and placing them here as 
standalone files:
    >
    >     
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C019177601dd14d18c0f708d8e96babab%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515995945828272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TR%2F7vhQkkZ5NdSHyUpBk9Zeq3DVvHuOn1ltaqEG19bc%3D&amp;reserved=0
    >
    >       I've binned the files roughly based on the container file's mime
    >     type, e.g. 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C019177601dd14d18c0f708d8e96babab%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515995945828272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b4miubAAVseiLWCaCakfvc9hFxke%2F3loqOiNBITZIeg%3D&amp;reserved=0
    >
    >       The process is still running, and I view this as a first draft.
    >     Please let me know if there's anything I can do to make these data
    >     easier to use/more useful or if you see any problems.
    >
    >       Cheers,
    >
    >                  Tim
    >

Reply via email to