[ https://issues.apache.org/jira/browse/TIKA-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463822#comment-17463822 ]
Tim Allison commented on TIKA-3629: ----------------------------------- In looking more carefully at [https://www.dublincore.org/specifications/dublin-core/usageguide/elements/,] it looks like I erred in trying to isolate keywords from subject. Because some file formats distinguish between keywords and subject, I incorrectly removed the joint dublin core key subject from the keywords key. I'll fix this in the next version (> 2.2.1) so that dublin-core's subject contains both keywords and subject, but users can still get subject or keywords via the file-format specific keywords keys. I'll also document this more clearly so that future devs don't repeat my mistake. Sorry about this... :( > Keywords are not extracted anymore from PDF documents > ----------------------------------------------------- > > Key: TIKA-3629 > URL: https://issues.apache.org/jira/browse/TIKA-3629 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 2.2.0 > Reporter: David Pilato > Priority: Major > > Hey > > I'm seeing some changes (regressions?) in [Tika 2.2.0 (from > 2.1.0)|https://github.com/dadoonet/fscrawler/pull/1330]. > When extracting content from Office files (docs, doc, rtf), {{cp:subject}} is > not generated anymore. I'm not using this value anyway so that's may be not > an issue at all but a feature ;) > > But, for PDF documents, I'm not able to get anymore the keywords for the > document. > I was reading the keywords with {{Office.KEYWORDS}} but it's now null and I > don't see this change documented in the wiki. > > Is that expected or a bug? > -- This message was sent by Atlassian Jira (v8.20.1#820001)