Grigorii Ioffe created TIKA-4466:
------------------------------------

             Summary: OPFParser extracts DublinCore fields partially
                 Key: TIKA-4466
                 URL: https://issues.apache.org/jira/browse/TIKA-4466
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.2.2
            Reporter: Grigorii Ioffe


I have an ePub file with metadata stored in an OPF file with multiple 
dc:identifier fields. But during its parsing OPFParser extracts only the last 
one. 

For example, if a OPF file inside ePub contains such entries of dc:identifier:


{code:java}
    <dc:identifier>isbn:9780765350381</dc:identifier>
    <dc:identifier>mobi-asin:JD4PTHPBGIAQYZUBFUU3VFPVEUKY7S3U</dc:identifier>
    <dc:identifier>amazon:0765350386</dc:identifier>
    <dc:identifier>goodreads:243272</dc:identifier>
    <dc:identifier>calibre:55</dc:identifier>
    <dc:identifier>uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier>
    <dc:identifier 
id="uuid_id">uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> {code}
only uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595 will be in parsed metadata.

According to the Dublin Core spec it is a valid situation as identifier marked 
as repeatable:
[https://www.w3.org/TR/epub-33/#sec-opf-dcidentifier]

My investigation showed that the field is created with PropertyType.SIMPLE here:
`org.apache.tika.metadata/DublinCore.class:60`
as a result, 
`org.apache.tika.metadata/Property.class:272`
returns false and therefore each entry overrides a value stored before instead 
of adding to an array.

 

Also, this is not the only field with incorrect type definition. Looks like 
that Title, language, description and some others fields are also defined 
incorrectly (or at least parsed in OPFParser and DCXmlParcer incorrectly)

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to