[ 
https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594074#comment-13594074
 ] 

Tejas Patil commented on TIKA-245:
----------------------------------

I am working on NUTCH-1454 and I am observing that tika is not able to extract 
contents from chm documents. (i tried with several chm files but it worked for 
none). Chm viewer however could show entire contents of the file. I am not the 
only guy who is facing this issue (see 
[here|http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-tp3999735p4001245.html])
                
> Support of CHM Format
> ---------------------
>
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.10
>
>         Attachments: TIKA-245.oleg.20110806.PATCH, 
> TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt, 
> TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
>
>
> It might be a good idea to support the CHM File format of Windows. Some 
> information about 
> http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. 
> The CHM format contains HTML files which can be parsed by Tika. So the "only" 
> problem is to extract the data from the CHM file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to