RE: Extract thumbnail from openxml office files

Hong-Thai Nguyen Thu, 09 Jan 2014 05:37:50 -0800

Hi Nick,
You're begining a very interesting topic about foundation of our metadata 
concept :)
I agree with you that metadata is not the best place to store thumbnail result. 
Until now, our metadata is simple map with key:values. This structure is not 
really flexiable in some cases. For exemple, we would store author's 
information, each author has a first name and a last name.
Ideally, we could have some like struct:
Person:
        FirstName
        LastName

An other example is for our futur thumbnail. If we can have a metadata 
'thumbnail' with hierarchical structure like:
Thumbnail:
        Dimension
                Width
                Length
        MimeType
        Extension
        Pages
        Description

That needs a huge refactoring about our core model. An other solution is we can 
keep thumbnail result is a list List<byte[]> insteads of a single value. An 
element is the thumbnail of a page. If the list has only 1 element, mean 
there's only thumbnail of the first page.

Hong-Thai

-----Message d'origine-----
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : jeudi 9 janvier 2014 12:11
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
> By searching on issues, I found the issue already created: 
> https://issues.apache.org/jira/browse/TIKA-90

I'm not sure if the metadata is the right place to return this. Some formats 
offer a small thumbnail, others can offer a small thumbnail for every page, and 
at least one can include a full-size image of the first page.

Would we not be better off exposing these embedded renderings via the existing 
embedded resources handling, with some sort of handy way to identify what 
something is (eg this is a full-size PNG of page 1, this is a jpg thumbnail of 
page 3)?

Nick

RE: Extract thumbnail from openxml office files

Reply via email to