[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914482#comment-16914482 ] Ken Krugler commented on TIKA-1599: --- >From TIKA-2928, an example of text that fails with

[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1599: -- Priority: Major (was: Minor) > Switch from TagSoup to JSoup > > >

[jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914481#comment-16914481 ] Ken Krugler commented on TIKA-2928: --- Hi [~Sargent_D] - thanks for trying this out! I'm g

[jira] [Updated] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-22 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2928: -- Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) > Less than sign within tag boun

[jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-22 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913382#comment-16913382 ] Ken Krugler commented on TIKA-2928: --- The issue isn't that this is "somewhat non-standard

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869004#comment-16869004 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - I finally got around to

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856107#comment-16856107 ] Ken Krugler commented on TIKA-2790: --- [~talli...@apache.org] - I'd have to look at the co

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856052#comment-16856052 ] Ken Krugler commented on TIKA-2790: --- Yalder processes the entire string. I thought Optim

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-05-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836738#comment-16836738 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - thanks for running the

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812492#comment-16812492 ] Ken Krugler commented on TIKA-2849: --- Hi [~boris-petrov] - two things here. First, do you

[jira] [Commented] (TIKA-2794) Tika extracts text from pdf on MacBook, but not windows server.,

2018-12-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710767#comment-16710767 ] Ken Krugler commented on TIKA-2794: --- Hi [~phallett] - it's better if you first post some

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707822#comment-16707822 ] Ken Krugler commented on TIKA-2790: --- [~talli...@apache.org] - I've compared Yalder to Op

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707521#comment-16707521 ] Ken Krugler commented on TIKA-2790: --- Yalder is about 2-2.5x faster than language-detecto

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707343#comment-16707343 ] Ken Krugler commented on TIKA-2790: --- My concern with OpenNLP is that during a web crawl,

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707292#comment-16707292 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - Is there an issue with

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658028#comment-16658028 ] Ken Krugler commented on TIKA-2758: --- [~markus17] - My comment above was about the previo

[jira] [Comment Edited] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657976#comment-16657976 ] Ken Krugler edited comment on TIKA-2758 at 10/20/18 7:51 PM: -

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657976#comment-16657976 ] Ken Krugler commented on TIKA-2758: --- At least for the "detroidnews.html" file, I believe

[jira] [Resolved] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

2018-07-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-2683. --- Resolution: Fixed Fixed via [PR #243|https://github.com/apache/tika/commit/8851d511c4768a3200eafa0623

[jira] [Assigned] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

2018-07-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-2683: - Assignee: Ken Krugler > Missing space and inappropriate new-line in Boilerpipe extracted text > -

[jira] [Commented] (TIKA-2648) mime detection based on resource name detects resources as "text/x-php" instead of "text/html"

2018-07-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536396#comment-16536396 ] Ken Krugler commented on TIKA-2648: --- [~wastl-nagel] - you mentioned that you thought thi

[jira] [Updated] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2671: -- Description: org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's metadata. So when

[jira] [Updated] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2671: -- Component/s: detector > HtmlEncodingDetector doesnt take provided metadata into account > --

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516644#comment-16516644 ] Ken Krugler commented on TIKA-2671: --- Hi [~gbouchar] - I'm curious how much testing you d

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514355#comment-16514355 ] Ken Krugler commented on TIKA-2671: --- Unfortunately there's no great solution here. Ideal

[jira] [Commented] (TIKA-2654) Installation issue

2018-05-29 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493927#comment-16493927 ] Ken Krugler commented on TIKA-2654: --- Hi Ankit - for problems encountered while building/

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482586#comment-16482586 ] Ken Krugler commented on TIKA-2643: --- When you've got conflicting jars on the classpath, y

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481791#comment-16481791 ] Ken Krugler commented on TIKA-2643: --- Looking at the crash log, I see the following duplic

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481786#comment-16481786 ] Ken Krugler commented on TIKA-2643: --- Hi [~fyemaple] - how do you know that Tika 1.5 (or a

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479468#comment-16479468 ] Ken Krugler commented on TIKA-2643: --- [~fyemaple] - yes, but note that {{kill -QUIT doesn

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477811#comment-16477811 ] Ken Krugler commented on TIKA-2643: --- [~talli...@apache.org] - different versions of frame

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477513#comment-16477513 ] Ken Krugler commented on TIKA-2643: --- If I was going to guess, it's that your Cloudera ins

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384242#comment-16384242 ] Ken Krugler commented on TIKA-2592: --- [~AndreasMeier] - I assume when you said: {quote}I d

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Attachment: IANA Charset names.txt > HTML with charset unicode handled as utf-16 instead utf-8 >

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Priority: Minor (was: Major) > HTML with charset unicode handled as utf-16 instead utf-8 > -

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Issue Type: Improvement (was: Bug) > HTML with charset unicode handled as utf-16 instead utf-8 > ---

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382330#comment-16382330 ] Ken Krugler commented on TIKA-2592: --- Before making this kind of change (default "unicode"

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-02-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380874#comment-16380874 ] Ken Krugler commented on TIKA-2592: --- Hi [~AndreasMeier] - actually "unicode" is a support

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379747#comment-16379747 ] Ken Krugler commented on TIKA-2576: --- [~talli...@mitre.org] - After some grepping, I found

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-26 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377744#comment-16377744 ] Ken Krugler commented on TIKA-2576: --- Is this going to trigger more warnings in the logs?

[jira] [Resolved] (TIKA-2539) TagSoup HTML parser is project EOL

2018-01-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-2539. --- Resolution: Duplicate > TagSoup HTML parser is project EOL > -- > >

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215838#comment-16215838 ] Ken Krugler commented on TIKA-2478: --- Hi [~talli...@apache.org] - I've attached two mixed

[jira] [Updated] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2478: -- Attachment: mixed-simple mixed-with-pdf-inline > MBOX import includes redundant copies of

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-22 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214491#comment-16214491 ] Ken Krugler commented on TIKA-2478: --- I recently had to dig into extracting text from emai

[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213150#comment-16213150 ] Ken Krugler commented on TIKA-2471: --- Hi [~talli...@apache.org] - I don't think using MBox

[jira] [Commented] (TIKA-2482) java.lang.NoSuchMethodError at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)

2017-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212870#comment-16212870 ] Ken Krugler commented on TIKA-2482: --- Hi [~cermar] - in general it's best to first post th

[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195386#comment-16195386 ] Ken Krugler commented on TIKA-2472: --- I had to deal with this before in another project -

[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-08-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423280#comment-15423280 ] Ken Krugler commented on TIKA-2056: --- Hi [~chrismattmann] - I haven't actually dealt with

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2038: -- Description: Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the

[jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML

2016-07-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378434#comment-15378434 ] Ken Krugler commented on TIKA-2033: --- Yes, of course...I was thinking of whether we'd want

[jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML

2016-07-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378358#comment-15378358 ] Ken Krugler commented on TIKA-2033: --- Do you have a suggestion for how the text should app

[jira] [Commented] (TIKA-2010) Unable to get value when header is incorrect

2016-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332124#comment-15332124 ] Ken Krugler commented on TIKA-2010: --- OK - I think then we'll want to escalate [TIKA-1599]

[jira] [Updated] (TIKA-2010) Unable to get value when header is incorrect

2016-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2010: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Unable to get value when heade

[jira] [Commented] (TIKA-2010) Unable to get value when header is incorrect

2016-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331829#comment-15331829 ] Ken Krugler commented on TIKA-2010: --- Would it be possible for you to try this broken HTML

[jira] [Closed] (TIKA-1938) HtmlParser drops

2016-05-10 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler closed TIKA-1938. - Resolution: Fixed Fix with commit da5bbbe..46d5775. Thanks Joseph! > HtmlParser drops elements found ins

[jira] [Assigned] (TIKA-1938) HtmlParser drops

2016-05-10 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-1938: - Assignee: Ken Krugler > HtmlParser drops elements found inside > -

[jira] [Commented] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-04-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227078#comment-15227078 ] Ken Krugler commented on TIKA-1835: --- I’d rolled in Markus’s patch directly to support the

[jira] [Updated] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-03-30 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1896: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Invalid closing script tag not

[jira] [Commented] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-03-30 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218412#comment-15218412 ] Ken Krugler commented on TIKA-1896: --- Hi Tim - hmm, changing the type of the script tag fr

[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-03-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202149#comment-15202149 ] Ken Krugler commented on TIKA-1855: --- In general I'd still prefer to keep test data with i

[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167891#comment-15167891 ] Ken Krugler commented on TIKA-1855: --- The things I don't like about this approach are that

[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15165642#comment-15165642 ] Ken Krugler commented on TIKA-1855: --- I'm ok with having some duplicated test files - thou

[jira] [Commented] (TIKA-1858) Unable to extract content from chunked portion of large file

2016-02-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150618#comment-15150618 ] Ken Krugler commented on TIKA-1858: --- Hi Raghu, This is a great question for the user mai

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-12 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145135#comment-15145135 ] Ken Krugler commented on TIKA-1851: --- +1 for the proposal. Let me know if you want me to t

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-10 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141632#comment-15141632 ] Ken Krugler commented on TIKA-1851: --- Hi [~talli...@apache.org] - thanks for generating th

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136079#comment-15136079 ] Ken Krugler commented on TIKA-1851: --- After poking around a bit, my vote would be to (a) m

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136077#comment-15136077 ] Ken Krugler commented on TIKA-1723: --- OK, I've committed this code to a new tika-langdetec

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136003#comment-15136003 ] Ken Krugler commented on TIKA-1851: --- I got a clean build w/o any pre-installed modules, s

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135342#comment-15135342 ] Ken Krugler commented on TIKA-1851: --- Hmm, now the top-level build fails on the tika parse

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135336#comment-15135336 ] Ken Krugler commented on TIKA-1851: --- I did a top-level "mvn clean install", which failed

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133629#comment-15133629 ] Ken Krugler commented on TIKA-1851: --- I'm also curious why we have Groovy code and shell s

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133624#comment-15133624 ] Ken Krugler commented on TIKA-1851: --- Hi [~talli...@apache.org] - I'm also getting a local

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132961#comment-15132961 ] Ken Krugler commented on TIKA-1723: --- Good idea re gathering input - I just emailed the de

[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131749#comment-15131749 ] Ken Krugler commented on TIKA-1824: --- As someone who regularly deals with 100s of jars in

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130676#comment-15130676 ] Ken Krugler commented on TIKA-1723: --- [~talli...@apache.org] I must admit, focusing on thi

[jira] [Commented] (TIKA-1848) Address issues with Tika 1.12rc#1

2016-02-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130666#comment-15130666 ] Ken Krugler commented on TIKA-1848: --- Unless I'm not understanding the issues properly, I

[jira] [Comment Edited] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1558#comment-1558 ] Ken Krugler edited comment on TIKA-1835 at 1/21/16 7:36 PM: Git

[jira] [Resolved] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-1835. --- Resolution: Fixed Git commit 489ab93..fe841bc > LinkContentHandler skips iframe and rel tags > ---

[jira] [Assigned] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-1835: - Assignee: Ken Krugler > LinkContentHandler skips iframe and rel tags > ---

[jira] [Commented] (TIKA-1838) Just a quick question regarding compatibility

2016-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109054#comment-15109054 ] Ken Krugler commented on TIKA-1838: --- Hi Raymond - this is a question that you should post

[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106908#comment-15106908 ] Ken Krugler commented on TIKA-1836: --- This seems to be an issue for POI, as per the messag

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048819#comment-15048819 ] Ken Krugler commented on TIKA-1599: --- I think we'd be wanting to parse the raw crawl resul

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048806#comment-15048806 ] Ken Krugler commented on TIKA-1599: --- Hi [~markus.jel...@openindex.io] - I was actually ta

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048773#comment-15048773 ] Ken Krugler commented on TIKA-1599: --- I'm hoping we could use one or the other, as I don't

[jira] [Commented] (TIKA-1808) Head section closed too eager

2015-12-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047029#comment-15047029 ] Ken Krugler commented on TIKA-1808: --- Hi Markus - I don't think this is actually a bug. I

[jira] [Commented] (TIKA-1794) TXTParser removes form feed characters

2015-11-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006797#comment-15006797 ] Ken Krugler commented on TIKA-1794: --- Tika uses XHTML 1.0, which doesn't allow the form-fe

[jira] [Commented] (TIKA-1794) TXTParser removes form feed characters

2015-11-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006743#comment-15006743 ] Ken Krugler commented on TIKA-1794: --- The output of the Tika parse process is XHTML, and I

[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

2015-10-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984111#comment-14984111 ] Ken Krugler commented on TIKA-1443: --- Hi [~talli...@apache.org] - I did look at it, and re

[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901434#comment-14901434 ] Ken Krugler commented on TIKA-1726: --- [~talli...@apache.org] had asked for input on this -

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729595#comment-14729595 ] Ken Krugler commented on TIKA-1723: --- Biggest remaining issue before I commit is how to de

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729588#comment-14729588 ] Ken Krugler commented on TIKA-1723: --- Hi Tim, 1. Not sure about "Make language detection

[jira] [Commented] (TIKA-491) Add language identification support for Norwegian Bokmål and Norwegian Nynorsk

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729432#comment-14729432 ] Ken Krugler commented on TIKA-491: -- Currently the language-detector library I'm integrating

[jira] [Assigned] (TIKA-491) Add language identification support for Norwegian Bokmål and Norwegian Nynorsk

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-491: Assignee: Ken Krugler > Add language identification support for Norwegian Bokmål and Norwegian Nynors

[jira] [Commented] (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729427#comment-14729427 ] Ken Krugler commented on TIKA-492: -- Currently the language-detector library I'm integrating

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729416#comment-14729416 ] Ken Krugler commented on TIKA-856: -- The language-detector project has support for Japanese,

[jira] [Commented] (TIKA-568) Language Detection isReasonablyCertain() hides valuable information

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729414#comment-14729414 ] Ken Krugler commented on TIKA-568: -- The new LanguageDetector API has a getRawScore() call o

[jira] [Assigned] (TIKA-568) Language Detection isReasonablyCertain() hides valuable information

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-568: Assignee: Ken Krugler > Language Detection isReasonablyCertain() hides valuable information > ---

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729250#comment-14729250 ] Ken Krugler commented on TIKA-1723: --- Regarding the current detection code... I'm going t

[jira] [Updated] (TIKA-1723) Integrate language-detector into Tika

2015-09-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1723: -- Attachment: TIKA-1723-3.patch New patch which uses Locale to handle language names (language tags). > In

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-09-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726266#comment-14726266 ] Ken Krugler commented on TIKA-1723: --- Hi Tim - I just attached a new version of my patch,

  1   2   3   4   >