>    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
>
Isn't that why are you using the XMP Toolkit???

Leonard

On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote:

    > ARGH!!!!   Please don't do this - it will get you the wrong results in 
almost all cases.     Remember that in a PDF with updates, there can/will be a 
new XMP block with each update.

    Ha, right.  I completely understand (perhaps _only_ this small point
    on PDFs).  On this pass, my goal was to see what was in the file at
    all, not what was the correct XMP. Part of my interest is in what's
    available in the file, but not available readily to the user.

    The other thing is that I wanted to scrape xmp out of files beyond PDFs.

    So, I can definitely take a second run where I let a PDF tool extract
    the correct XMP if there's interest in that.

    On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
    <lrose...@adobe.com.invalid> wrote:
    >
    > >      I'm literally just scraping bytes out of files for now without any 
parsing
    > >
    > ARGH!!!!   Please don't do this - it will get you the wrong results in 
almost all cases.     Remember that in a PDF with updates, there can/will be a 
new XMP block with each update.
    >
    >
    > > if I traverse the COSDocument's objects and look     for /Metadata and 
grab the stream, will that be what you're looking     for?
    > >
    > Just getting those elements would be a great start.  If you could also 
include the rest of the dictionary in which it was found (or at least the /Type 
and /Subtype keys, if present) would be great!
    >
    > Leonard
    >
    > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote:
    >
    >     Hi Leonard,
    >       I'm literally just scraping bytes out of files for now without any
    >     parsing...so if the XMP is concealed in a compressed stream or
    >     something more interesting, I'm not grabbing it.  I'm also not
    >     tracking which XMP is associated with which object.
    >       Please forgive me...if I traverse the COSDocument's objects and look
    >     for /Metadata and grab the stream, will that be what you're looking
    >     for?  Or, is there a commandline tool I can run to get what you're
    >     interested in?
    >       Thank you.
    >
    >       Cheers,
    >
    >                   Tim
    >
    >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    >     <lrose...@adobe.com.invalid> wrote:
    >     >
    >     > Are you only pulling document-level XMP?  If so, could you extend 
it to support object-level metadata as well?   I, for one, would love to get 
insight into the use of object-level metadata - what objects are they attached 
to, what are they being used for, etc.
    >     >
    >     > Leonard
    >     >
    >     > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> wrote:
    >     >
    >     >     All,
    >     >
    >     >       I'm scraping XMPs out of our corpus and placing them here as 
standalone files:
    >     >
    >     >     
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137263979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Xbzilw%2BpDWMnfVCtbMvLoAAMw0dLQM3S4rpli%2B%2BZUtY%3D&amp;reserved=0
    >     >
    >     >       I've binned the files roughly based on the container file's 
mime
    >     >     type, e.g. 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137273937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=R%2Fa6VoPWTqcCl52gBP8HLlLzVA5Xb1D4vtg2itxTx30%3D&amp;reserved=0
    >     >
    >     >       The process is still running, and I view this as a first 
draft.
    >     >     Please let me know if there's anything I can do to make these 
data
    >     >     easier to use/more useful or if you see any problems.
    >     >
    >     >       Cheers,
    >     >
    >     >                  Tim
    >     >
    >

Reply via email to