Re: Wrong parsing of XML

2014-07-11 Thread Ken Krugler
On Jul 11, 2014, at 8:01am, Avi Hayun wrote: > Hi, > > Scenario: > 1. I use tika-core in my app > 2. I use the following to detect the stream's media type: > > byte[] bytes = IOUtils.toByteArray(new URL("http://www.amazon.com/sitemap_ > video.xml")); > String contentType = new Tika().detect(by

RE: NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Ken Krugler
t; at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > ... 25 more > > -- > -- > Hong-Thai -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Ken Krugler
ppt (14) > - xls (9) > - dwg (4) > - odp (495) > - odt (839) > - pps (2) > - ods (1) > > 1.7-SNASPHOT: > - pdf (7) - pptx (10) - doc (6) - ppt (14) - xls (9) - dwg (4) - odp (2) - > pps (2) > > > On Thu, Sep 11, 2014 at 8:55 PM, Ken Krugler > wrote: >

RE: Parse Html with Tika

2014-11-03 Thread Ken Krugler
> Can you tell me what i can do to parse all tag of html. > > Thanks advance! > > Regards, > Tang Thi Phuong Linh. > -- > P.Linh -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascadin

RE: Move definitively from SVN to Git ?

2014-11-19 Thread Ken Krugler
effort SVN-using committers would have to expend? > > I don't mean to incite a VCS war. ;) git v. svn is more like a brushfire that flares up every few months, at least on the @members list :) -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: SVN access issue

2014-12-12 Thread Ken Krugler
://svn.apache.org/repos/asf/tika/trunk' > svn: E000111: Error running context: Connection refuse > > Can it be related to the recent infra-related issue or is it just a temp > problem ? Working for me, just tried. -- Ken -- Ken Krugler +1 530-210-

RE: Licensing Question

2015-03-20 Thread Ken Krugler
33a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java>. > Is this correct and OK to use? > > Thanks, > Tyler -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassa

RE: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Ken Krugler
yone have any last minute issues they'd like to finish and see in > Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and > TIKA-1586). Any others? > > Have a good weekend, > Tyler -- Ken Krugler +1 530-210-6378 http://www.s

RE: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-09 Thread Ken Krugler
gt; gpg: There is no indication that the signature belongs to the owner. > Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4 183E 8810 BB19 D4F1 0117 > > Not sure if Chris, Lewis et al are near you and do this quickly? > > Cheers, > Dave --

RE: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-15 Thread Ken Krugler
.8 > [ ] ±0 I don't object to this release, but I haven't checked it > [ ] -1 Do not release this package because... > > Thanks, > Tyler -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-19 Thread Ken Krugler
ae2d7fdd31. >> >> In addition, a staged maven repository is available here: >> https://repository.apache.org/content/repositories/orgapachetika-1009 >> >> Please vote on releasing this package as Apache Tika 1.8. The vote is > open for the next 72 hours and pass

RE: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Ken Krugler
> https://dist.apache.org/repos/dist/dev/tika/ >>> >>> The release candidate is a zip archive of the sources in: >>> http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ >>> >>> The SHA1 checksum of the archive is >>> 5e22fee9079370398472e59082

RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Ken Krugler
e: text/x-java-source > LoC: 70 > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser > resourceName: UrlParser.java > > Should I build a parser for each file format to get an exact content-type, as > Java has Sour

RE: comparing Tika's file detect with other tools?

2015-04-22 Thread Ken Krugler
vely reverse engineering (when we > find that Tika is wrong) from a non-Apache project? > > Any other sensitivities I should be aware of? > > Best, > > Tim -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Ken Krugler
ind of previous project are you looking into? It's the Krugle code search product. Being sold as enterprise software, but they might be willing to open source the parsing code. -- Ken > ____________ > From: Ken Krugler [kkrugler_li...@transpac.com] >

DOAP questions

2015-05-08 Thread Ken Krugler
dit that, but I don't know where in the sequence it makes sense. I assume it should be in step 13, "Update Tika site" Thanks, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Ken Krugler
++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Bayesian N-Gram Language Detection

2015-07-28 Thread Ken Krugler
il: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++ > > -

RE: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-04 Thread Ken Krugler
ority of at least > three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.10 > > [ ] -1 Do not release this package because... > > Here is my +1! > > Cheers, > Dave -- Ken Krugler +1 530-210

RE: Adding API support for Java 7's java.nio.file.Path

2015-08-29 Thread Ken Krugler
> > > This email communication (including any attachments) contains information > from Answers Corporation or its affiliates that is confidential and may be > privileged. The information contained herein is intended only for the use > of the addressee(s) named above. If you

Remove support for building language identifier profiles?

2015-08-29 Thread Ken Krugler
Hi all, As part of integrating language-detector into Tika (see TIKA-1723), I noticed TIKA-546 ("Add ability to create language profiles to tika-app") If we switch over to language-detector, then this code no longer makes sense. Also note that many language detectors require the full set of lan

RE: more modular parser bundles

2015-11-30 Thread Ken Krugler
k >> Components: parser >>Reporter: Madhav Sharan >> >> >> As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a >> location. CLI requires jvm and lucene to instantiate for every request. >> With all new REST api

RE: Tika 2.0 Source in Modules or tika-parser

2015-12-14 Thread Ken Krugler
flicting dependencies managed by maven. I don't have any experience with moving classes around to create modules, so my natural inclination is to move the sources. As far as shared code, I think moving something like commons-codec into core (100K) is fine. -- Ken -

RE: [VOTE] Moving SCM to Git

2016-01-02 Thread Ken Krugler
ection (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++ > Adjunct Associate Professor,

RE: [VOTE] Moving SCM to Git

2016-01-04 Thread Ken Krugler
Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Depar

RE: Are we on git?

2016-01-22 Thread Ken Krugler
VN", and > http://tika.apache.org/contribute.html still talks about SVN being our master. > > What's the status? Have we switched? Still in progress? Where should we > commit to? Is it time to delete our SVN checkouts and re-checkout from git? > > Cheers >

RE: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

2016-01-25 Thread Ken Krugler
t >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++ >>> >>> -Original Message- >>> From: Markus Jelsma >>> Reply-To: "u...@tika.apache.org&quo

RE: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-28 Thread Ken Krugler
on releasing this package as Apache Tika 1.12. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.12 > [ ] -1 Do not release this package because… > > Cheers, >

Tika 2.0 and language detection

2016-02-04 Thread Ken Krugler
. Thanks, -- Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

scm info in pom.xml

2016-02-06 Thread Ken Krugler
/tika/trunk/tika-langdetect scm:svn:https://svn.apache.org/repos/asf/tika/trunk/tika-langdetect What's the plan (if any) for switching to git details in poms? Thanks, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions

project.build.sourceEncoding

2016-02-06 Thread Ken Krugler
- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Tracking 2.x migration changes

2016-02-06 Thread Ken Krugler
Is there a document where we're tracking what (breaking) API changes are occurring in the 2.x branch, and the migration path from 1.x for Tika users? If not, should this be a wiki page that we all edit iteratively? Thanks, -- Ken ------ Ken Krugler +1 530-210-6378

Use of interface vs. abstract class

2016-02-09 Thread Ken Krugler
ServiceLoader require that these be interfaces? I assume not, as isAssignableFrom() should work with either interfaces or abstract classes, right? Asking because I'm looking at the language detector API for 2.x. Thanks, -- Ken ------ Ken Krugler +1 530-210-6378

RE: Use of interface vs. abstract class

2016-02-09 Thread Ken Krugler
nset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ >

RE: Integrating Tika with MITLL Text.jl library for language detection

2016-02-23 Thread Ken Krugler
ith Tika 1.11 language detector. > https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRemCrbC1miY/edit?usp=sharing > > I was also looking at the work done by Ken Krugler on Tika's 2.x branch > language detection and I was planning to fork that project and add the > Text

RE: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

2016-03-28 Thread Ken Krugler
community to add our method there, wait for a new release and use > that! See https://issues.apache.org/jira/browse/TIKA-1706 for the issue - and seems like 2.0 is a fine place to make the clean switch to just using Commons IOUtils. -- Ken -- Ken Krugler +1 530-2

Who's going to Apache: Big Data in May?

2016-03-29 Thread Ken Krugler
th-america/program/schedule ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Squashing GitHub pull requests while merging

2016-05-06 Thread Ken Krugler
evelopers/github.html> Isn't this something we’d want to do as well? Thanks, — Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: My "What's new with Apache Tika 2.0" talk slides

2016-05-11 Thread Ken Krugler
;t make it to Vancouver this week, the slides from my > "What's new with Apache Tika 2.0" talk are now available online: > http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20 > > The audio was recorded, hopefully that will be available to go with the >

Re: xmpcore in Maven Central?

2016-07-15 Thread Ken Krugler
org/browse/OSSRH-22250, looks like it’s https://in.linkedin.com/in/meetabhishekjindal <https://in.linkedin.com/in/meetabhishekjindal> — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

2016-08-24 Thread Ken Krugler
est coverage to ensure common usecases won't be broken, of course. > > [1]: > https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206 > -- > > Be

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

2016-08-24 Thread Ken Krugler
his issue doesn't > affect me directly. > > [1]: http://proguard.sourceforge.net/index.html#manual/usage.html > [2]: http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtm > > > ср, 24 авг. 2016 г. в 21:16, Ken Krugler : > >> I think excluding mor

Re: https://issues.apache.org/jira/browse/INFRA-12186

2016-09-20 Thread Ken Krugler
s.apache.org/jira/browse/INFRA-12186, it will help > us to reduce major bugs in Tika over time. > Thanks > Lewis > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney -- Ken Krugler +1 530-210-6378 http

Re: https://issues.apache.org/jira/browse/INFRA-12186

2016-09-21 Thread Ken Krugler
Hi Lewis, > On Sep 21, 2016, at 2:32pm, lewis john mcgibbney wrote: > > Hi Ken, > Good question. Answer below > > On Wed, Sep 21, 2016 at 2:16 PM, wrote: > >> >> From: Ken Krugler >> To: dev@tika.apache.org >> Cc: >> Date: Tu

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-24 Thread Ken Krugler
-1 Do not release this package because.. > > Cheers, > Chris > > P.S. Of course here is my +1. > > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-11-01 Thread Ken Krugler
[ ] +1 Release this package as Apache Tika 1.14 > [ ] -1 Do not release this package because.. > > Cheers, > Chris > > P.S. Of course here is my +1. > > > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-21 Thread Ken Krugler
for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.20 > [ ] -1 Do not release this package because... > > Here's my +1. > > Cheers, > > Tim -

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

2019-01-17 Thread Ken Krugler
nl > ruru > zhlt > > Is there something that needs to be done to enable the detection of Asian > languages or should I file this as a bug report? > > Thanks, > > Mike -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.co

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

2019-01-17 Thread Ken Krugler
at it then. Regards, — Ken > On Jan 17, 2019, at 1:48 PM, Mike Thomsen wrote: > > Ken, > > Here's a Gist version of it: > > https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b > > Thanks, > > Mike > > On Thu, Jan 17, 2019 at

Re: Wiki migration

2019-03-21 Thread Ken Krugler
s to do so. Maybe only Tim as PMC Chair can. > > -- > Best regards, > Konstantin Gribov. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Wiki migration

2019-04-17 Thread Ken Krugler
ing wiki migration (from moin to >> confluence)? >>> >>> I can try it via selfservice.a.o if you consent but I'm not sure if I >> have >>> enough access to do so. Maybe only Tim as PMC Chair can. >>> >>> -- >>> Best regards, >>> Konstantin Gribov. >> -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Detection of plain text files

2019-06-18 Thread Ken Krugler
-- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Detection of plain text files

2019-06-25 Thread Ken Krugler
failed. > > In short, this is an area for improvement. I suspect our current > mechanism would also be pretty awful on UTF-16. > > On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler > wrote: >> >> Hi devs, >> >> I’m trying to remember the history of how Tika’s cu

Re: 1.22?

2019-07-17 Thread Ken Krugler
+1 — Ken > On Jul 15, 2019, at 2:37 PM, Tim Allison wrote: > > Anyone have anything they want to get into 1.22? If not, I’ll kick off the > regression tests shortly. > > Cheers, > Tim ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimite

Re: [ANNOUNCE] Apache Tika 1.22 released

2019-08-02 Thread Ken Krugler
ika.apache.org/ > > -- Tim Allison, on behalf of the Apache Tika community -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: [ANNOUNCE] Welcome Tilman Hausherr as Tika PMC member and committer

2019-10-04 Thread Ken Krugler
Cheers, > > Tim ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: HTML to PDF conversion

2019-10-14 Thread Ken Krugler
o the text to PDF > (for a start, something on top of that transformer), and then may be even > for other formats ? > > Sergey -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: HTML to PDF conversion

2019-10-16 Thread Ken Krugler
ent(someImage); > creator.complete(); > > It would be consistent with the Tika approach on the read side. > > Cheers, Sergey > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler wrote: > >> If you’re suggesting ways to make it easier to use something like >> YaHPConver

Re: Grant write access to our wiki to Eric Pugh

2019-10-29 Thread Ken Krugler
ve for > change notifications double-check!) > > Thanks > Nick -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Grant write access to our wiki to Eric Pugh

2019-10-30 Thread Ken Krugler
a wiki page, making a whole sale set of > edits, getting review of those edits from the community, and assuming it > passes muster, then bringing the edits back to the original page? > > > > Eric > >> On Oct 29, 2019, at 7:00 PM, Ken Krugler wrote: >> &g

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Ken Krugler
ime to start working on integrating Bob's >> >> work on the current main branch. I'll have to ignore most of the incoming >> >> issues for a bit...unlike the last 4 years...this time I mean it. :) >> >> Let me know if there are any objections to heading down this path now. >> >> >> >> Cheers, >> >> >> >> Tim -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

2020-11-20 Thread Ken Krugler
ache.org/jira/browse/TIKA-2917 >>>Project: Tika >>> Issue Type: Improvement >>> Reporter: Tim Allison >>> Assignee: Tim Allison >>> Priority: Minor >>> >>> Inline images may have XMP associated with them. We are not currently >>> extracting this metadata. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v7.6.14#76016) -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [jira] [Commented] (TIKA-2917) Extract metadata from inline images in PDFs

2020-11-21 Thread Ken Krugler
e/lib/security/ > sudo cp ~/Downloads/UnlimitedJCEPolicyJDK8/local_policy.jar > $JAVA_HOME/jre/lib/security/ — Ken > > On Fri, Nov 20, 2020 at 1:43 PM Ken Krugler > wrote: > >> Hi all, >> >> I was trying to build the 1.25-rc1 branch, and ran into this same issue >&

More issues with top-level build for Tika 1.25 rc1 - Waited more than 5 minutes for a SAXParser

2020-11-23 Thread Ken Krugler
asing the XMLReaderUtils.POOL_SIZE Nov 21, 2020 10:39:07 PM org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE … and so on… Any suggestions? Thanks! — Ken -- Ken Krugler http://www.scaleunlimite

Re: More issues with top-level build for Tika 1.25 rc1 - Waited more than 5 minutes for a SAXParser

2020-11-23 Thread Ken Krugler
131-b11, mixed mode) — Ken > On Mon, Nov 23, 2020 at 1:40 PM Ken Krugler > wrote: > >> Hi all, >> >> I got past the JCE issue, but now some tests are failing with timeouts. >> >> For this test: >> >> [INFO] Running org.apache.tika.parser.micr

Re: More issues with top-level build for Tika 1.25 rc1 - Waited more than 5 minutes for a SAXParser

2020-11-23 Thread Ken Krugler
rent SAXParser which is not handled correctly in > XMLReaderUtils? What OS, what version of java? > > Thank you, again. > > Best, > > Tim > > On Mon, Nov 23, 2020 at 1:40 PM Ken Krugler > wrote: > >> Hi

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-25 Thread Ken Krugler
a> > > Please vote on releasing this package as Apache Tika 1.25. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.25 > [ ] -1 Do not release this package bec

Re: [jira] [Commented] (TIKA-3292) Remove GSON where possible in 2.x

2021-02-04 Thread Ken Krugler
Allison >> Priority: Minor >>Fix For: 2.0.0 >> >> >> We or our dependencies use 4? json parsers last time I looked. It feels like >> a majority of our dependencies use jackson. I used to have a preference for >> GSON, which is why we h

Re: high level parser module names in 2.x

2021-03-09 Thread Ken Krugler
dency, etc. > > Some options for classic-> basic, base, ...what else? > > Any other recommendations for these names? Thank you! > > Best, > > Tim -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: TIKA build error

2017-03-15 Thread Ken Krugler
but was:<[ > > ]> > Tests in error: > ODFParserTest.testNullStylesInODTFooter:367 » WriteLimitReached Your > document ... > > ODFParserTest.testParagraphLevelFontStyles:388->TikaTest.getXML:191->TikaTest.getXML:205 > » SAX -- K

Re: [DISCUSS] Contribution guide & style enforcement

2017-03-29 Thread Ken Krugler
tracking issue > [5]: http://checkstyle.sourceforge.net/ > [6]: https://maven.apache.org/plugins/maven-checkstyle-plugin/ > > > > -- > > Best regards, > Konstantin Gribov -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: 2.x and tika-core dependencies

2017-03-30 Thread Ken Krugler
> (Maven/Ant+Ivy/Gradle/SBT/whatever) in their projects, > so it shouldn't be something bothersome for end user. > > What do you think, folks? > > [1]: https://issues.apache.org/jira/browse/TIKA-2314 > > -- > > Best regards, > Konstantin Gribov

Re: 1.15 in https://mvnrepository.com ?

2017-06-05 Thread Ken Krugler
a step in the release? No, I don’t believe so. > Does it take a few weeks for the sync? Here’s what I’ve heard (from a forum post): > Also FYI, mvnrepository.com is unaffiliated with Maven Central, and lags it > by anywhere from a few hours to a few days So potentially a few days.

Use of java.util.logging in Tika

2018-01-31 Thread Ken Krugler
Hi devs, I’m curious about the occasional use of java.util.logging in Tika: > ./tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java:import > java.util.logging.Logger; > ./tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java:import > java.util.logging.

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Ken Krugler
Thanks for the ref, Tim. I’m curious why SolrCell doesn’t fire up threads when parsing docs with Tika (or use the fork parser), to mitigate issues with hangs & crashes? — Ken > On May 29, 2018, at 11:54 AM, Tim Allison wrote: > > All, > > Over the weekend, Shawn Heisey very kindly drafted a

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Ken Krugler
> 2018-05-29 16:11 GMT-03:00 Ken Krugler : > >> Thanks for the ref, Tim. >> >> I’m curious why SolrCell doesn’t fire up threads when parsing docs with >> Tika (or use the fork parser), to mitigate issues with hangs & crashes? >> >> — Ken

Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Ken Krugler
e target? This would allow us to bake modularity in now. > Given that I haven't actually tried modularizing/jigsawizing Tika yet, this > could be a complete disaster, of course. :) > > Cheers, > > Tim -- Ken Krugler +1 530-

Re: 1.27?

2021-06-30 Thread Ken Krugler
M Nicholas DiPiazza >>> wrote: >>>> >>>> +1 on 1.27 release. >>>> >>>> On Mon, Jun 28, 2021, 10:57 AM Tim Allison wrote: >>>>> >>>>> All, >>>>> The recent release of PDFBox fixed 2 DoS CVEs

Re: surefire and system.exit

2021-07-28 Thread Ken Krugler
est this with more > recent versions of the surefire plugin, or is there a recommended > workaround? > > Thank you. > >Best, > > Tim > > [0] > http://maven.apache.org/surefire/maven-surefire-plugin/faq.html#vm-termination --

Re: Proposed topics for next Tika meetups?

2021-11-09 Thread Ken Krugler
gt; a) tika-pipes hands-on workshop > b) get to know the users -- 5 minute go-around the room "this is how > we use it; these are our pain points" > c) ??? > > Again, thank you! > > Best, > > Tim -- Ken K

Re: [VOTE] Release Apache Tika 2.2.0 Candidate #1

2021-12-13 Thread Ken Krugler
s successfully. >> >>> >>> [X] +1 Release this package as Apache Tika 2.2.0 >> >> I did notice that the tika DL's module(s) are pulling in the enire Hadoop >> dependency chain. I wonder if we can cut down on this... that is however a >> concern outside of this release candidate review. >> >> Thanks for the quick turnaround. >> lewismc >> -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Support page?

2023-05-12 Thread Ken Krugler
with > ASF projects. I'd want to copy the header pretty much literally about > no endorsements, etc. > What would you think of adding something similar to our wiki or our website? > >Best, > > Tim -- Ken Krugler http://www

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Ken Krugler
e, Sep 12, 2023 at 10:49 AM Tim Allison <mailto:talli...@apache.org>> wrote: >> >If Tika users will be happy to move on and drop Java 8 and/or javax. Please >> >drop them :))) >> >> Fellow devs and broader Tika community, are we ok with EOL'ing Tika

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Ken Krugler
) Keep Java 11 in "main"/3.x now and set the EOL for Tika 2.x/Java 8 in say > 6 months or fewer? > > Thank you, all, for your feedback! > > Best, > > Tim > > -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Magika file type detection

2024-02-19 Thread Ken Krugler
Hi Tika devs, Check out Magika at https://github.com/google/magika Wondering if we could leverage Deeplearing4j to run the model from that project. — Ken -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink & Pinot

OCR dataset

2024-04-03 Thread Ken Krugler
Hi devs, I saw this dataset on Hugging Face, seems useful for evaluating Tika OCR… — Ken https://huggingface.co/datasets/pixparse/idl-wds

Normalizing meta tag names

2011-08-19 Thread Ken Krugler
n that's found in most web pages, from what I see. -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions

Request for patch review - TIKA-431

2011-09-16 Thread Ken Krugler
ined. Regards, -- Ken ---------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

Support for Open Graph meta tags

2011-09-22 Thread Ken Krugler
at would take a tag like: http://www.imdb.com/title/tt0117500/"; /> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"; Thoughts on this? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

Re: Support for Open Graph meta tags

2011-09-23 Thread Ken Krugler
On Sep 23, 2011, at 3:24am, Jukka Zitting wrote: > Hi, > > On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler > wrote: >> The reason why is that Open Graph uses RDFa > > Instead of mapping the RDFa tags to Tika's Metadata and then > back to normal XHTML tags, we

Re: Support for Open Graph meta tags

2011-09-23 Thread Ken Krugler
say, me, where the end result is likely to be horribly wrong. For better or worse, RDF has never been an itch that I've needed to scratch. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

Re: Newb: IDE + Maven?

2011-10-03 Thread Ken Krugler
pom.xml > Path: /tika-bundle-it > Location: line 87 > Type: Maven Project Build Lifecycle Mapping Problem > > I looked up the problem and came up with this link: > http://wiki.eclipse.org/M2E_plugin_execution_not_covered > > However, I don't understand what is actually g

Re: Google's Compact Language Detector

2011-10-24 Thread Ken Krugler
hrome/trunk/src/third_party/cld/ > > Best regards > > Jérôme > > -- > @jcharron > http://motre.ch/ > http://jcharron.posterous.com/ > http://www.shopreflex.fr/ > http://www.staragora.com/ > > <http://feeds.feedburner.com/~r/Bligblagblog/~6/1> -

Re: Google's Compact Language Detector

2011-10-24 Thread Ken Krugler
va language detect library > (http://code.google.com/p/language-detection)... hoping to finish that > soon and do a followon blog post. > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, Oct 24, 2011 at 9:45 AM, Ken Krugler > wrote: >> I took a qui

Re: Google's Compact Language Detector

2011-10-25 Thread Ken Krugler
py, and every three characters triggers a new String() -- Ken > http://blog.mikemccandless.com > > On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless > wrote: >> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler >> wrote: >> >>> Sounds like a great idea - see the recent comment thr

Re: Tika 1.0 RC?

2011-10-25 Thread Ken Krugler
; Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

Re: A problem in the right-to-left languages

2011-11-01 Thread Ken Krugler
hem, I fear there may > not be anyone left in their project who's interested in charset detectors any > more. I'd love to be proved wrong though, if anyone has any personal contacts > on the project they could prod about it? > > Nick -- Ken Krugl

Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-07 Thread Ken Krugler
; Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > +++

Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-07 Thread Ken Krugler
66B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA >

Re: Pluggable language detection

2012-03-21 Thread Ken Krugler
logspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

  1   2   3   4   5   6   >