Grigorii Ioffe created TIKA-4466: ------------------------------------ Summary: OPFParser extracts DublinCore fields partially Key: TIKA-4466 URL: https://issues.apache.org/jira/browse/TIKA-4466 Project: Tika Issue Type: Bug Components: parser Affects Versions: 3.2.2 Reporter: Grigorii Ioffe
I have an ePub file with metadata stored in an OPF file with multiple dc:identifier fields. But during its parsing OPFParser extracts only the last one. For example, if a OPF file inside ePub contains such entries of dc:identifier: {code:java} <dc:identifier>isbn:9780765350381</dc:identifier> <dc:identifier>mobi-asin:JD4PTHPBGIAQYZUBFUU3VFPVEUKY7S3U</dc:identifier> <dc:identifier>amazon:0765350386</dc:identifier> <dc:identifier>goodreads:243272</dc:identifier> <dc:identifier>calibre:55</dc:identifier> <dc:identifier>uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> <dc:identifier id="uuid_id">uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> {code} only uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595 will be in parsed metadata. According to the Dublin Core spec it is a valid situation as identifier marked as repeatable: [https://www.w3.org/TR/epub-33/#sec-opf-dcidentifier] My investigation showed that the field is created with PropertyType.SIMPLE here: `org.apache.tika.metadata/DublinCore.class:60` as a result, `org.apache.tika.metadata/Property.class:272` returns false and therefore each entry overrides a value stored before instead of adding to an array. Also, this is not the only field with incorrect type definition. Looks like that Title, language, description and some others fields are also defined incorrectly (or at least parsed in OPFParser and DCXmlParcer incorrectly) -- This message was sent by Atlassian Jira (v8.20.10#820010)