OCR dataset

2024-04-03 Thread Ken Krugler
Hi devs, I saw this dataset on Hugging Face, seems useful for evaluating Tika OCR… — Ken https://huggingface.co/datasets/pixparse/idl-wds

Magika file type detection

2024-02-19 Thread Ken Krugler
Hi Tika devs, Check out Magika at https://github.com/google/magika Wondering if we could leverage Deeplearing4j to run the model from that project. — Ken -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink & Pinot

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Ken Krugler
) Keep Java 11 in "main"/3.x now and set the EOL for Tika 2.x/Java 8 in say > 6 months or fewer? > > Thank you, all, for your feedback! > > Best, > > Tim > > -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Ken Krugler
e, Sep 12, 2023 at 10:49 AM Tim Allison <mailto:talli...@apache.org>> wrote: >> >If Tika users will be happy to move on and drop Java 8 and/or javax. Please >> >drop them :))) >> >> Fellow devs and broader Tika community, are we ok with EOL'ing Tika

Re: Support page?

2023-05-12 Thread Ken Krugler
with > ASF projects. I'd want to copy the header pretty much literally about > no endorsements, etc. > What would you think of adding something similar to our wiki or our website? > >Best, > > Tim -- Ken Krugler http://www

Re: [VOTE] Release Apache Tika 2.2.0 Candidate #1

2021-12-13 Thread Ken Krugler
s successfully. >> >>> >>> [X] +1 Release this package as Apache Tika 2.2.0 >> >> I did notice that the tika DL's module(s) are pulling in the enire Hadoop >> dependency chain. I wonder if we can cut down on this... that is however a >> concern outside of this release candidate review. >> >> Thanks for the quick turnaround. >> lewismc >> -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Proposed topics for next Tika meetups?

2021-11-09 Thread Ken Krugler
gt; a) tika-pipes hands-on workshop > b) get to know the users -- 5 minute go-around the room "this is how > we use it; these are our pain points" > c) ??? > > Again, thank you! > > Best, > > Tim -- Ken K

Re: surefire and system.exit

2021-07-28 Thread Ken Krugler
est this with more > recent versions of the surefire plugin, or is there a recommended > workaround? > > Thank you. > >Best, > > Tim > > [0] > http://maven.apache.org/surefire/maven-surefire-plugin/faq.html#vm-termination --

Re: 1.27?

2021-06-30 Thread Ken Krugler
M Nicholas DiPiazza >>> wrote: >>>> >>>> +1 on 1.27 release. >>>> >>>> On Mon, Jun 28, 2021, 10:57 AM Tim Allison wrote: >>>>> >>>>> All, >>>>> The recent release of PDFBox fixed 2 DoS CVEs

Re: high level parser module names in 2.x

2021-03-09 Thread Ken Krugler
dency, etc. > > Some options for classic-> basic, base, ...what else? > > Any other recommendations for these names? Thank you! > > Best, > > Tim -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [jira] [Commented] (TIKA-3292) Remove GSON where possible in 2.x

2021-02-04 Thread Ken Krugler
Allison >> Priority: Minor >>Fix For: 2.0.0 >> >> >> We or our dependencies use 4? json parsers last time I looked. It feels like >> a majority of our dependencies use jackson. I used to have a preference for >> GSON, which is why we h

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-25 Thread Ken Krugler
a> > > Please vote on releasing this package as Apache Tika 1.25. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.25 > [ ] -1 Do not release this package bec

Re: More issues with top-level build for Tika 1.25 rc1 - Waited more than 5 minutes for a SAXParser

2020-11-23 Thread Ken Krugler
rent SAXParser which is not handled correctly in > XMLReaderUtils? What OS, what version of java? > > Thank you, again. > > Best, > > Tim > > On Mon, Nov 23, 2020 at 1:40 PM Ken Krugler > wrote: > >> Hi

Re: More issues with top-level build for Tika 1.25 rc1 - Waited more than 5 minutes for a SAXParser

2020-11-23 Thread Ken Krugler
131-b11, mixed mode) — Ken > On Mon, Nov 23, 2020 at 1:40 PM Ken Krugler > wrote: > >> Hi all, >> >> I got past the JCE issue, but now some tests are failing with timeouts. >> >> For this test: >> >> [INFO] Running org.apache.tika.parser.micr

More issues with top-level build for Tika 1.25 rc1 - Waited more than 5 minutes for a SAXParser

2020-11-23 Thread Ken Krugler
asing the XMLReaderUtils.POOL_SIZE Nov 21, 2020 10:39:07 PM org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE … and so on… Any suggestions? Thanks! — Ken -- Ken Krugler http://www.scaleunlimite

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

2020-11-21 Thread Ken Krugler
e/lib/security/ > sudo cp ~/Downloads/UnlimitedJCEPolicyJDK8/local_policy.jar > $JAVA_HOME/jre/lib/security/ — Ken > > On Fri, Nov 20, 2020 at 1:43 PM Ken Krugler > wrote: > >> Hi all, >> >> I was trying to build the 1.25-rc1 branch, and ran into this same issue >&

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

2020-11-20 Thread Ken Krugler
ache.org/jira/browse/TIKA-2917 >>>Project: Tika >>> Issue Type: Improvement >>> Reporter: Tim Allison >>> Assignee: Tim Allison >>> Priority: Minor >>> >>> Inline images may have XMP associated with them. We are not currently >>> extracting this metadata. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v7.6.14#76016) -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Ken Krugler
ime to start working on integrating Bob's >> >> work on the current main branch. I'll have to ignore most of the incoming >> >> issues for a bit...unlike the last 4 years...this time I mean it. :) >> >> Let me know if there are any objections to heading down this path now. >> >> >> >> Cheers, >> >> >> >> Tim -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Grant write access to our wiki to Eric Pugh

2019-10-30 Thread Ken Krugler
a wiki page, making a whole sale set of > edits, getting review of those edits from the community, and assuming it > passes muster, then bringing the edits back to the original page? > > > > Eric > >> On Oct 29, 2019, at 7:00 PM, Ken Krugler wrote: >> &g

Re: Grant write access to our wiki to Eric Pugh

2019-10-29 Thread Ken Krugler
ve for > change notifications double-check!) > > Thanks > Nick -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: HTML to PDF conversion

2019-10-16 Thread Ken Krugler
ent(someImage); > creator.complete(); > > It would be consistent with the Tika approach on the read side. > > Cheers, Sergey > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler wrote: > >> If you’re suggesting ways to make it easier to use something like >> YaHPConver

Re: HTML to PDF conversion

2019-10-14 Thread Ken Krugler
o the text to PDF > (for a start, something on top of that transformer), and then may be even > for other formats ? > > Sergey -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [ANNOUNCE] Welcome Tilman Hausherr as Tika PMC member and committer

2019-10-04 Thread Ken Krugler
Cheers, > > Tim ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914482#comment-16914482 ] Ken Krugler commented on TIKA-1599: --- >From TIKA-2928, an example of text tha

[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1599: -- Priority: Major (was: Minor) > Switch from TagSoup to JS

[jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914481#comment-16914481 ] Ken Krugler commented on TIKA-2928: --- Hi [~Sargent_D] - thanks for trying this out!

[jira] [Updated] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-22 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2928: -- Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) > Less than sign within

[jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-22 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913382#comment-16913382 ] Ken Krugler commented on TIKA-2928: --- The issue isn't that this is &quo

Re: [ANNOUNCE] Apache Tika 1.22 released

2019-08-02 Thread Ken Krugler
ika.apache.org/ > > -- Tim Allison, on behalf of the Apache Tika community -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: 1.22?

2019-07-17 Thread Ken Krugler
+1 — Ken > On Jul 15, 2019, at 2:37 PM, Tim Allison wrote: > > Anyone have anything they want to get into 1.22? If not, I’ll kick off the > regression tests shortly. > > Cheers, > Tim ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimite

Re: Detection of plain text files

2019-06-25 Thread Ken Krugler
failed. > > In short, this is an area for improvement. I suspect our current > mechanism would also be pretty awful on UTF-16. > > On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler > wrote: >> >> Hi devs, >> >> I’m trying to remember the history of how Tika’s cu

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869004#comment-16869004 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - I finally got ar

Detection of plain text files

2019-06-18 Thread Ken Krugler
-- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856107#comment-16856107 ] Ken Krugler commented on TIKA-2790: --- [~talli...@apache.org] - I'd have to lo

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856052#comment-16856052 ] Ken Krugler commented on TIKA-2790: --- Yalder processes the entire string. I tho

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-05-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836738#comment-16836738 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - thanks for running

Re: Wiki migration

2019-04-17 Thread Ken Krugler
ing wiki migration (from moin to >> confluence)? >>> >>> I can try it via selfservice.a.o if you consent but I'm not sure if I >> have >>> enough access to do so. Maybe only Tim as PMC Chair can. >>> >>> -- >>> Best regards, >>> Konstantin Gribov. >> -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812492#comment-16812492 ] Ken Krugler commented on TIKA-2849: --- Hi [~boris-petrov] - two things here. First

Re: Wiki migration

2019-03-21 Thread Ken Krugler
s to do so. Maybe only Tim as PMC Chair can. > > -- > Best regards, > Konstantin Gribov. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

2019-01-17 Thread Ken Krugler
at it then. Regards, — Ken > On Jan 17, 2019, at 1:48 PM, Mike Thomsen wrote: > > Ken, > > Here's a Gist version of it: > > https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b > > Thanks, > > Mike > > On Thu, Jan 17, 2019 at

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

2019-01-17 Thread Ken Krugler
nl > ruru > zhlt > > Is there something that needs to be done to enable the detection of Asian > languages or should I file this as a bug report? > > Thanks, > > Mike -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.co

Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-21 Thread Ken Krugler
for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.20 > [ ] -1 Do not release this package because... > > Here's my +1. > > Cheers, > > Tim -

[jira] [Commented] (TIKA-2794) Tika extracts text from pdf on MacBook, but not windows server.,

2018-12-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710767#comment-16710767 ] Ken Krugler commented on TIKA-2794: --- Hi [~phallett] - it's better if you f

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707822#comment-16707822 ] Ken Krugler commented on TIKA-2790: --- [~talli...@apache.org] - I've compared

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707521#comment-16707521 ] Ken Krugler commented on TIKA-2790: --- Yalder is about 2-2.5x faster than lang

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707343#comment-16707343 ] Ken Krugler commented on TIKA-2790: --- My concern with OpenNLP is that during a web c

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707292#comment-16707292 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - Is there an issue

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658028#comment-16658028 ] Ken Krugler commented on TIKA-2758: --- [~markus17] - My comment above was about

[jira] [Comment Edited] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657976#comment-16657976 ] Ken Krugler edited comment on TIKA-2758 at 10/20/18 7:5

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657976#comment-16657976 ] Ken Krugler commented on TIKA-2758: --- At least for the "detroidnews.html

[jira] [Resolved] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

2018-07-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-2683. --- Resolution: Fixed Fixed via [PR #243|https://github.com/apache/tika/commit

[jira] [Assigned] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

2018-07-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-2683: - Assignee: Ken Krugler > Missing space and inappropriate new-line in Boilerpipe extracted t

[jira] [Commented] (TIKA-2648) mime detection based on resource name detects resources as "text/x-php" instead of "text/html"

2018-07-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536396#comment-16536396 ] Ken Krugler commented on TIKA-2648: --- [~wastl-nagel] - you mentioned that you tho

Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Ken Krugler
e target? This would allow us to bake modularity in now. > Given that I haven't actually tried modularizing/jigsawizing Tika yet, this > could be a complete disaster, of course. :) > > Cheers, > > Tim -- Ken Krugler +1 530-

[jira] [Updated] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2671: -- Description: org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's metadata. So

[jira] [Updated] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2671: -- Component/s: detector > HtmlEncodingDetector doesnt take provided metadata into acco

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516644#comment-16516644 ] Ken Krugler commented on TIKA-2671: --- Hi [~gbouchar] - I'm curious how much te

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514355#comment-16514355 ] Ken Krugler commented on TIKA-2671: --- Unfortunately there's no great solu

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Ken Krugler
> 2018-05-29 16:11 GMT-03:00 Ken Krugler : > >> Thanks for the ref, Tim. >> >> I’m curious why SolrCell doesn’t fire up threads when parsing docs with >> Tika (or use the fork parser), to mitigate issues with hangs & crashes? >> >> — Ken

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Ken Krugler
Thanks for the ref, Tim. I’m curious why SolrCell doesn’t fire up threads when parsing docs with Tika (or use the fork parser), to mitigate issues with hangs & crashes? — Ken > On May 29, 2018, at 11:54 AM, Tim Allison wrote: > > All, > > Over the weekend, Shawn Heisey very kindly drafted a

[jira] [Commented] (TIKA-2654) Installation issue

2018-05-29 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493927#comment-16493927 ] Ken Krugler commented on TIKA-2654: --- Hi Ankit - for problems encountered while buil

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482586#comment-16482586 ] Ken Krugler commented on TIKA-2643: --- When you've got conflicting jars on the

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481791#comment-16481791 ] Ken Krugler commented on TIKA-2643: --- Looking at the crash log, I see the follo

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481786#comment-16481786 ] Ken Krugler commented on TIKA-2643: --- Hi [~fyemaple] - how do you know that Tika 1.5

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479468#comment-16479468 ] Ken Krugler commented on TIKA-2643: --- [~fyemaple] - yes, but note that {{kill -

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477811#comment-16477811 ] Ken Krugler commented on TIKA-2643: --- [~talli...@apache.org] - different version

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477513#comment-16477513 ] Ken Krugler commented on TIKA-2643: --- If I was going to guess, it's that your

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384242#comment-16384242 ] Ken Krugler commented on TIKA-2592: --- [~AndreasMeier] - I assume when you said: {quo

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Attachment: IANA Charset names.txt > HTML with charset unicode handled as utf-16 instead ut

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Priority: Minor (was: Major) > HTML with charset unicode handled as utf-16 instead ut

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Issue Type: Improvement (was: Bug) > HTML with charset unicode handled as utf-16 instead ut

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382330#comment-16382330 ] Ken Krugler commented on TIKA-2592: --- Before making this kind of change (default &quo

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-02-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380874#comment-16380874 ] Ken Krugler commented on TIKA-2592: --- Hi [~AndreasMeier] - actually "unic

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379747#comment-16379747 ] Ken Krugler commented on TIKA-2576: --- [~talli...@mitre.org] - After some greppin

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-26 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377744#comment-16377744 ] Ken Krugler commented on TIKA-2576: --- Is this going to trigger more warnings in the

Use of java.util.logging in Tika

2018-01-31 Thread Ken Krugler
Hi devs, I’m curious about the occasional use of java.util.logging in Tika: > ./tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java:import > java.util.logging.Logger; > ./tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java:import > java.util.logging.

[jira] [Resolved] (TIKA-2539) TagSoup HTML parser is project EOL

2018-01-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-2539. --- Resolution: Duplicate > TagSoup HTML parser is project

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215838#comment-16215838 ] Ken Krugler commented on TIKA-2478: --- Hi [~talli...@apache.org] - I've attached

[jira] [Updated] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2478: -- Attachment: mixed-simple mixed-with-pdf-inline > MBOX import includes redundant cop

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-22 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214491#comment-16214491 ] Ken Krugler commented on TIKA-2478: --- I recently had to dig into extracting text

[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213150#comment-16213150 ] Ken Krugler commented on TIKA-2471: --- Hi [~talli...@apache.org] - I don't th

[jira] [Commented] (TIKA-2482) java.lang.NoSuchMethodError at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)

2017-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212870#comment-16212870 ] Ken Krugler commented on TIKA-2482: --- Hi [~cermar] - in general it's best to f

[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195386#comment-16195386 ] Ken Krugler commented on TIKA-2472: --- I had to deal with this before in another pro

Re: 1.15 in https://mvnrepository.com ?

2017-06-05 Thread Ken Krugler
a step in the release? No, I don’t believe so. > Does it take a few weeks for the sync? Here’s what I’ve heard (from a forum post): > Also FYI, mvnrepository.com is unaffiliated with Maven Central, and lags it > by anywhere from a few hours to a few days So potentially a few days.

Re: 2.x and tika-core dependencies

2017-03-30 Thread Ken Krugler
> (Maven/Ant+Ivy/Gradle/SBT/whatever) in their projects, > so it shouldn't be something bothersome for end user. > > What do you think, folks? > > [1]: https://issues.apache.org/jira/browse/TIKA-2314 > > -- > > Best regards, > Konstantin Gribov

Re: [DISCUSS] Contribution guide & style enforcement

2017-03-29 Thread Ken Krugler
tracking issue > [5]: http://checkstyle.sourceforge.net/ > [6]: https://maven.apache.org/plugins/maven-checkstyle-plugin/ > > > > -- > > Best regards, > Konstantin Gribov -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: TIKA build error

2017-03-15 Thread Ken Krugler
but was:<[ > > ]> > Tests in error: > ODFParserTest.testNullStylesInODTFooter:367 » WriteLimitReached Your > document ... > > ODFParserTest.testParagraphLevelFontStyles:388->TikaTest.getXML:191->TikaTest.getXML:205 > » SAX -- K

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-11-01 Thread Ken Krugler
[ ] +1 Release this package as Apache Tika 1.14 > [ ] -1 Do not release this package because.. > > Cheers, > Chris > > P.S. Of course here is my +1. > > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-24 Thread Ken Krugler
-1 Do not release this package because.. > > Cheers, > Chris > > P.S. Of course here is my +1. > > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: https://issues.apache.org/jira/browse/INFRA-12186

2016-09-21 Thread Ken Krugler
Hi Lewis, > On Sep 21, 2016, at 2:32pm, lewis john mcgibbney wrote: > > Hi Ken, > Good question. Answer below > > On Wed, Sep 21, 2016 at 2:16 PM, wrote: > >> >> From: Ken Krugler >> To: dev@tika.apache.org >> Cc: >> Date: Tu

Re: https://issues.apache.org/jira/browse/INFRA-12186

2016-09-20 Thread Ken Krugler
s.apache.org/jira/browse/INFRA-12186, it will help > us to reduce major bugs in Tika over time. > Thanks > Lewis > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney -- Ken Krugler +1 530-210-6378 http

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

2016-08-24 Thread Ken Krugler
his issue doesn't > affect me directly. > > [1]: http://proguard.sourceforge.net/index.html#manual/usage.html > [2]: http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtm > > > ср, 24 авг. 2016 г. в 21:16, Ken Krugler : > >> I think excluding mor

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

2016-08-24 Thread Ken Krugler
est coverage to ensure common usecases won't be broken, of course. > > [1]: > https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206 > -- > > Be

[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-08-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423280#comment-15423280 ] Ken Krugler commented on TIKA-2056: --- Hi [~chrismattmann] - I haven't actually d

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2038: -- Description: Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the

Re: xmpcore in Maven Central?

2016-07-15 Thread Ken Krugler
org/browse/OSSRH-22250, looks like it’s https://in.linkedin.com/in/meetabhishekjindal <https://in.linkedin.com/in/meetabhishekjindal> — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

[jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML

2016-07-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378434#comment-15378434 ] Ken Krugler commented on TIKA-2033: --- Yes, of course...I was thinking of whether

[jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML

2016-07-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378358#comment-15378358 ] Ken Krugler commented on TIKA-2033: --- Do you have a suggestion for how the text sh

[jira] [Commented] (TIKA-2010) Unable to get value when header is incorrect

2016-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332124#comment-15332124 ] Ken Krugler commented on TIKA-2010: --- OK - I think then we'll want to escalate [

[jira] [Updated] (TIKA-2010) Unable to get value when header is incorrect

2016-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2010: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Unable to get value w

  1   2   3   4   5   6   >