[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144865#comment-13144865 ] Joseph Vychtrle commented on TIKA-772: -- Funny thing Jukka, I will talk to Cedric Beust

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144862#comment-13144862 ] Jukka Zitting commented on TIKA-772: The metacharacters you mention do sound suspicious.

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144855#comment-13144855 ] Joseph Vychtrle commented on TIKA-772: -- Attached... I'm on linux, using UTF-8 encoding

[jira] [Updated] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Vychtrle updated TIKA-772: - Attachment: it.html > media type detection fails for html documents, results in text/plain inst

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144854#comment-13144854 ] Jukka Zitting commented on TIKA-772: The test case you added prints out "text/html" for

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144853#comment-13144853 ] Joseph Vychtrle commented on TIKA-772: -- But to be honest, it makes sense. Tika doesn't

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144851#comment-13144851 ] Joseph Vychtrle commented on TIKA-772: -- Weird, {noformat} java -jar tika-app-0.10.jar -

Re: [VOTE] Apache Tika 1.0 release rc #1

2011-11-05 Thread Dave Meikle
Hi Chris, On 4 November 2011 15:42, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > > Please vote on releasing this package as Apache Tika 1.0. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > >[X] +1 Releas

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144849#comment-13144849 ] Jukka Zitting commented on TIKA-772: The latter method makes also the .html suffix avail

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144840#comment-13144840 ] Joseph Vychtrle commented on TIKA-772: -- Got it, if I do {code}tika.detect(TikaInputStr

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144836#comment-13144836 ] Jukka Zitting commented on TIKA-772: I piped the files to tika-app to prevent it from se

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144828#comment-13144828 ] Joseph Vychtrle commented on TIKA-772: -- MimeType detector doesn't find it, name of the

[jira] [Updated] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Vychtrle updated TIKA-772: - Attachment: tika.png I don't know then. Take a look at my results with tika v 0.10

[jira] [Resolved] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-772. Resolution: Cannot Reproduce Assignee: Jukka Zitting Works for me: {code} $ for f in *.html; d

[jira] [Updated] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Vychtrle updated TIKA-772: - Attachment: html.zip > media type detection fails for html documents, results in text/plain ins

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Joseph Vychtrle (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144772#comment-13144772 ] Joseph Vychtrle commented on TIKA-772: -- Hey Jukka, I found it happened only for html

[jira] [Commented] (TIKA-772) media type detection fails for html documents, results in text/plain instead of text/html

2011-11-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144763#comment-13144763 ] Jukka Zitting commented on TIKA-772: Can you attach an example document that illustrates

Re: [VOTE] Apache Tika 1.0 release rc #1

2011-11-05 Thread Christian Goeller
+1 BR Christian _ From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] To: dev@tika.apache.org [mailto:dev@tika.apache.org] Cc: u...@tika.apache.org [mailto:u...@tika.apache.org] Sent: Fri, 04 Nov 2011 16:42:29 +0100 Subject: [VOTE] Apache Tika 1.0 release rc #1 Hi Folk

Re: Multilingual Tika

2011-11-05 Thread Michael McCandless
I would love to see better integration w/ dynamic languages! I can help on the Python side. Can we simply wrap Tika's APIs using jcc, to expose in Python? Ooh, it's already been done: http://redmine.djity.net/projects/pythontika/wiki Mike McCandless http://blog.mikemccandless.com 2011/11/5 Jé

Re: Multilingual Tika

2011-11-05 Thread Jérôme Charron
> > I totally am. I've got some PHP skillz and Python skillz > that I would be willing to throw into the mix here. > Yes, I have some basic skillz on Python, and some advanced skillz on PHP, so I can help you! > One other thing along these lines I've had in mind for a while: > how cool would it b

[jira] [Commented] (TIKA-529) IBM420 charset detection's isLamAlef is allocation-happy

2011-11-05 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144652#comment-13144652 ] Michael McCandless commented on TIKA-529: - This patch looks safe, and avoids crazy a