[ 
https://issues.apache.org/jira/browse/TIKA-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

james closed TIKA-4395.
-----------------------
    Resolution: Not A Bug

turns out when parsing an InputStream, tika may not recognize some ooxml files 
and will instead parse them as generic zip files, resulting in useless content. 
 by writing the file to a temp file first before parsing, it will then be 
correctly parsed as an ooxml file.

> cannot get any slide content for pptx file
> ------------------------------------------
>
>                 Key: TIKA-4395
>                 URL: https://issues.apache.org/jira/browse/TIKA-4395
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.3, 3.1.0
>            Reporter: james
>            Priority: Major
>
> i have a reasonably large pptx file from which i don't get any slide content. 
>  i get slide notes, and some ocr from embedded images, but not the slide 
> content itself.  unfortunately, i cannot share the file, but i can answer 
> questions about it if necessary (and can probably share some of the internal 
> structure related files). 
>  
> using poi 5.4.0, not in streaming mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to