Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Oleg Tikhonov Sat, 22 Oct 2011 00:56:08 -0700

Hi Tran Nam Quang,
Currently our CHM extractor skips all entities that are not HTML.
It would be great if you could write a list of desired entities to be
extracted. In addition, if you can, please attach the CHM files you're
working with.


BR,
Oleg



On Sat, Oct 22, 2011 at 8:08 AM, Tran Nam Quang (Commented) (JIRA) <
[email protected]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133260#comment-13133260]
>
> Tran Nam Quang commented on TIKA-245:
> -------------------------------------
>
> @ Oleg
> I tested the CHM parser from Tika 0.10 on a few sample CHM files and found
> that many valid CHM entries are skipped. For comparison, I ran the same test
> with the chm4j library, which does _not_ skip these entries. Do you know
> about this problem?
>
> > Support of CHM Format
> > ---------------------
> >
> >                 Key: TIKA-245
> >                 URL: https://issues.apache.org/jira/browse/TIKA-245
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >         Environment: All
> >            Reporter: Karl Heinz Marbaise
> >            Assignee: Chris A. Mattmann
> >            Priority: Minor
> >             Fix For: 0.10
> >
> >         Attachments: TIKA-245.oleg.20110806.PATCH,
> TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt,
> TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
> >
> >
> > It might be a good idea to support the CHM File format of Windows. Some
> information about
> http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML.
> The CHM format contains HTML files which can be parsed by Tika. So the
> "only" problem is to extract the data from the CHM file.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

Re: [jira] [Commented] (TIKA-245) Support of CHM Format

Reply via email to