Thanks Tim!
Compiled StormCrawler with Tika 2.9.0 and ran a crawl without noticing any
issues.
+1 (non binding) to release
Julien
On Wed, 23 Aug 2023 at 15:50, Tim Allison wrote:
> A candidate for the Tika 2.9.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.9.0
>
>
Thanks Tim,
I have tried with the RC2 and it is now working fine.
+1 from me
J
On Thu, 11 May 2023 at 21:08, Tim Allison wrote:
> A candidate for the Tika 2.8.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.8.0
>
> The release candidate is a zip archive of the sourc
Thanks Tim,
I am testing 2.8.0 with StormCrawler
Apart from a lot of warning about missing classes like
*Caused by: java.lang.ClassNotFoundException:
org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream *
I am also getting a failed test when trying to extract text from an
embedd
Hi Tim,
Thanks for the release. I ran Tika 2.7.0 with StormCrawler and did not
notice any problems.
Cheers
Julien
On Tue, 31 Jan 2023 at 19:13, Tim Allison wrote:
> A candidate for the Tika 2.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.7.0
>
> The release cand
[
https://issues.apache.org/jira/browse/TIKA-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche closed TIKA-2269.
---
thanks for committing [~talli...@mitre.org]
> NPE with FeedPar
Julien Nioche created TIKA-2269:
---
Summary: NPE with FeedParser
Key: TIKA-2269
URL: https://issues.apache.org/jira/browse/TIKA-2269
Project: Tika
Issue Type: Bug
Components: parser
riginal Message-
> From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> Sent: Thursday, October 20, 2016 8:34 AM
> To: dev@tika.apache.org
> Subject: Re: [VOTE] Apache Tika 1.14 Release Candidate #1
>
> Hi
>
> Am getting the following when running 'mvn clean pa
Hi
Am getting the following when running 'mvn clean package', have I forgotten
something obvious?
Julien
*Failed tests: *
* ForkParserIntegrationTest.testParserHandlingOfNonSerializable:210
expected: but
was:*
*Tests in error: *
*
ForkParserIntegrationTest.testAttachingADebuggerOnTheForkedParse
+1
On 2 January 2016 at 04:30, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:
> Hi Everyone,
>
> DISCUSS thread here: http://s.apache.org/wVE
>
> Time to officially VOTE on moving Tika to Git. I’ve made a wiki
> page for our SCM explaining how to use Git at Apache, and how to
>
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049248#comment-15049248
]
Julien Nioche commented on TIKA-1599:
-
Don't think that this is the version
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049239#comment-15049239
]
Julien Nioche commented on TIKA-1599:
-
Hi [~talli...@mitre.org]
Haven't kept
56, Julien Nioche
wrote:
> and I haven't tested it with Nutch either...
>
> On 20 April 2015 at 15:46, Julien Nioche
> wrote:
>
>> I haven't tested the RC with Behemoth, it will probably have the same
>> issue but I'll do like you and defer the update if t
and I haven't tested it with Nutch either...
On 20 April 2015 at 15:46, Julien Nioche
wrote:
> I haven't tested the RC with Behemoth, it will probably have the same
> issue but I'll do like you and defer the update if that's the case.
>
> On 20 April 201
I haven't tested the RC with Behemoth, it will probably have the same issue
but I'll do like you and defer the update if that's the case.
On 20 April 2015 at 15:23, Ken Krugler wrote:
>
> > From: Allison, Timothy B.
> > Sent: April 20, 2015 5:11:04am PDT
> > To: dev@tika.apache.org
> > Subject:
ed in favor of rc1!
>
> Details...
>
> I reran against govdocs1, and there aren't any major surprises.
>
> On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that
> Julien Nioche created for us, and I ran against that as well. That turned
> up TIKA-1605 and an
[
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487012#comment-14487012
]
Julien Nioche commented on TIKA-1599:
-
FWIW we've just added a JSoup based
[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228305#comment-14228305
]
Julien Nioche commented on TIKA-1302:
-
FYI have extracted data from the CommonC
[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226397#comment-14226397
]
Julien Nioche commented on TIKA-1302:
-
Sure, will get back to you re-details of
[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226336#comment-14226336
]
Julien Nioche commented on TIKA-1302:
-
Hi [~talli...@apache.org]
It would be eas
[
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217749#comment-14217749
]
Julien Nioche commented on TIKA-595:
Thanks Dave!
> HtmlHandler does not
[
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-595:
---
Fix Version/s: 1.7
> HtmlHandler does not support multivalue metad
[
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-595:
---
Attachment: TIKA-595.patch
Any reason why we wouldn't want to have multiple values in the metada
Hi Linh
You can specify a mapper to control what the html parser will filter or not.
see
https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639
for an example
Julien
On Monday, 3 November 2014, Linh Tang wrote:
> Dear All,
>
> I am Phuong Linh,
> I am u
[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001612#comment-14001612
]
Julien Nioche commented on TIKA-1302:
-
How large do you want that batch to be? I
Hi Dave,
+1 from me. Compiled fine on Linux Mint + tested Maven artefacts with
Behemoth and ran a parse without problems.
Thanks for doing this.
Julien
On 9 February 2014 22:53, Dave Meikle wrote:
> Hi Guys,
>
> A new release candidate for the Tika 1.5 release is now available at:
> http://p
Hi Dave
Am trying to compile from src and am getting
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]
[ERROR] The project org.apache.tika:tika-java7:1.5-SNAPSHOT
(/data/tika-1.5/tika-java7/pom.xml) has 1 error
[ERROR] Non-resolvable parent POM: Could not find artifact
org.apac
Hi,
I had a look at Any23 some time ago and found that it overlapped with quite
a few other projects indeed but could (should?) have either relied on those
projects (e.g. parsing and mimetype stuff to Tika) or delegated the
functionality altogether (e.g. crawling to Nutch) instead of reinventing
t
Hi,
Just to let you know that we have just release the version 0.3 of
crawler-commons. Crawler-commons is a set of reusable Java components that
implement functionality common to any web crawler. These components benefit
from collaboration among various existing web crawler projects, and reduce
du
from the way we deal
with the parsers?
Thanks for your comments
Julien
On 21 March 2012 16:55, Ken Krugler wrote:
>
> On Mar 21, 2012, at 8:51am, Julien Nioche wrote:
>
> > Hi guys,
> >
> > Just wondering about the best way to make the language detection
> pluggabl
That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika. Does anyone have a spare cluster on EC2 ;-) ?
Julien
On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.go
+1 from me
On 27 September 2011 06:18, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:
> Hi Folks,
>
> OK, the proposal period had died now and I'm now calling a formal VOTE on
> the Any23 proposal located here:
>
> http://wiki.apache.org/incubator/Any23Proposal
>
> Proposal text
This is not a Tika issue. Ask this on the Nutch user list instead.
On 9 September 2011 22:34, hadi wrote:
> when i want to index video file with nutch 1.3 i get the following error :
>
> *Error parsing: file:///D:/film.avi: failed(2,0): Can't retrieve Tika
> parser
> for
> mime-type video/x-ms
[
https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-612:
---
Attachment: Tika-612.patch
Patch which allows to specify the options via the Context object. WDYT
[
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089498#comment-13089498
]
Julien Nioche commented on TIKA-696:
The text of the watermark can be found towards
[
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-696:
---
Attachment: Demo+with+watermark.docx
.docx version generated with MS Office
Can't see the wate
[
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089480#comment-13089480
]
Julien Nioche commented on TIKA-696:
Can't see the watermark when saving and
[
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-696:
---
Attachment: Demo with watermark.doc
Attached doc file containing a watermark
> Extract waterma
Reporter: Julien Nioche
Attachments: Demo with watermark.doc
It would be nice to store the text of a watermark as metadata.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
Hi
It's a few months since 0.9 and our Tika in Action book is soon ready
> for print, so I think it's good time to start planning for the 1.0
> release.
>
> There are a few odds and ends that I'd still like to sort out in the
> trunk, but overall I think we're in a pretty much ready for the switch
[
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche reassigned TIKA-657:
--
Assignee: Julien Nioche
> Email parser gets into trouble on malformed html in enron cor
[
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030467#comment-13030467
]
Julien Nioche commented on TIKA-657:
Good idea. We need more tutorials and example
[
https://issues.apache.org/jira/browse/TIKA-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026266#comment-13026266
]
Julien Nioche commented on TIKA-649:
Sorry, should have tested on the trunk as
NPE while parsing a .docx
---
Key: TIKA-649
URL: https://issues.apache.org/jira/browse/TIKA-649
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.9
Reporter: Julien
[
https://issues.apache.org/jira/browse/TIKA-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-649:
---
Attachment: Popcorn.docx
Wikipedia content on popcorn within a docx page
> NPE while parsing a .d
Hi guys,
We are currently getting duplicated text for the heading from .doc files
e.g.
*29. No Partnership or Agency XE "29. No
Partnership or Agency" *
XE seems to be a flag in MS Word
http://taxonomist.tripod.com/indexing/wordflags.html but I don't think it
should be displayed.
Have I missed
[
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche closed TIKA-611.
--
> PDFParser mixes the text from separate colu
[
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved TIKA-611.
Resolution: Fixed
Committed revision 1079705.
Opened TIKA-612 for the params via ParseContext
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox
are currently hardwritten in the PDFParser code, we will allow them to be
specified via the ParseContext objects
[
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004035#comment-13004035
]
Julien Nioche commented on TIKA-611:
The current behaviour is incorrect not only
[
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003884#comment-13003884
]
Julien Nioche commented on TIKA-611:
No objections? Shall I commit this?
> PD
: 0.9
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 1.0
As reported on the dev list by Michael Schmitz :
bq. I don't think the current snapshot is parsing articles (pdfs with
columns/beads) correctly. The text is not in the write order
[
https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved TIKA-597.
Resolution: Fixed
Fix Version/s: 1.0
Committed revision 1076300
Thanks Benson
> Bo
[
https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001497#comment-13001497
]
Julien Nioche commented on TIKA-597:
Benson,
I can't see any TikaRuntimeExc
[
https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche reassigned TIKA-597:
--
Assignee: Julien Nioche (was: Chris A. Mattmann)
> Bogus exception handler
>
>
> Please vote on releasing these packages as Apache Tika 0.9. The vote is
> open
> for the next 72 hours. Only votes from Tika PMC are binding, but everyone
> is welcome to check the release candidate and voice their approval or
> disapproval. The vote passes if at least three binding +1 votes
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965286#action_12965286
]
Julien Nioche commented on TIKA-461:
patch -p1 failed
peb...@lucid-vostro:/data/
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965271#action_12965271
]
Julien Nioche commented on TIKA-461:
Benjamin, thanks for your patch. Could you gene
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-461:
---
Attachment: testRFC822-multipart
Test document for mail parsing with multiparts, text + html
Hi Ben,
Great! I still haven't found the time to work on Nick's suggestions but you
can definitely work on the tests if you want to and add some of the emails
you mentioned. Having some cases of multipart with HTML and txt content +
images and attachments would be good.
Thanks
Julien
On 25 Nove
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930180#action_12930180
]
Julien Nioche commented on TIKA-461:
Nope. I was planning to refactor the parser f
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915708#action_12915708
]
Julien Nioche commented on TIKA-461:
Nick,
Thanks for taking the time to revie
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915269#action_12915269
]
Julien Nioche commented on TIKA-461:
Hi guys,
Could anyone have a look at the p
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-461:
---
Issue Type: New Feature (was: Bug)
changed from bug to new feature
> RFC822 messages not par
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-461:
---
Attachment: TIKA-461.patch
This patch contains an initial version of the RFC822Parser which uses
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906468#action_12906468
]
Julien Nioche commented on TIKA-461:
I'll have a look at mime4j and try to
[
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche reassigned TIKA-461:
--
Assignee: Julien Nioche
> RFC822 messages not par
[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899465#action_12899465
]
Julien Nioche commented on TIKA-463:
Look good. I must be missing something obvious
[
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved TIKA-460.
Resolution: Fixed
Committed revision 985444
The A elements are now processed correctly when using
[
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898335#action_12898335
]
Julien Nioche commented on TIKA-460:
Hi Ken, correct. The A's get bypassed
+1 from me
On 2 August 2010 18:33, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:
> Hi Tika community,
>
> Jukka Zitting and I are working on the Tika in Action book [1]. How would
> everyone feel about us posting a link to it on the Tika website [2]?
>
> If so, I'll prepare a p
[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958
]
Julien Nioche commented on TIKA-463:
Am very tempted to push things one step further
[
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche closed TIKA-466.
--
> Feed Parser
> ---
>
> Key: TIKA-466
>
[
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890380#action_12890380
]
Julien Nioche commented on TIKA-466:
Thanks Chris for reviewing and committin
[
https://issues.apache.org/jira/browse/TIKA-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889883#action_12889883
]
Julien Nioche commented on TIKA-147:
There is http://www.jswiff.com/licensing/ w
[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-463:
---
Attachment: TIKA-463.patch
Patch which implements some of the ideas described in this issue
[
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-466:
---
Attachment: TIKA-466.patch
> Feed Parser
> ---
>
> K
Feed Parser
---
Key: TIKA-466
URL: https://issues.apache.org/jira/browse/TIKA-466
Project: Tika
Issue Type: New Feature
Components: parser
Reporter: Julien Nioche
Priority: Minor
[
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887718#action_12887718
]
Julien Nioche commented on TIKA-460:
this would work if we had in the list of
[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716
]
Julien Nioche commented on TIKA-463:
creating a LinksHtmlMapper : +1, that would
[
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-460:
---
Attachment: TIKA-460.patch
> HTMLHandler misses treatment of A eleme
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 0.8
The A elements should be processed before any other safe element, otherwise it
never happens
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the
Reporter: Julien Nioche
Attachments: TIKA-458.patch
One of the recent changes on Tika is the possibility to specify a custom
HTMLMapper via the Context - which I think is an elegant mechanism. I was
wondering whether there would be a reason NOT to be able to do the same for the
HTMLHandler
HTMLParser gets an early event
--
Key: TIKA-457
URL: https://issues.apache.org/jira/browse/TIKA-457
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Julien Nioche
I am
[
https://issues.apache.org/jira/browse/TIKA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-458:
---
Attachment: TIKA-458.patch
> Specify HTMLHandler via Cont
Hi guys,
One of the recent changes on Tika is the possibility to specify a custom
HTMLMapper via the Context - which I think is an elegant mechanism. I was
wondering whether there would be a reason NOT to be able to do the same for
the HTMLHandler and if nothing is passed via the Context, rely on
[
https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche closed TIKA-454.
--
Resolution: Fixed
Committed revision 960487
> Illegal Charset Name crashes HTMLPar
[
https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche reassigned TIKA-454:
--
Assignee: Julien Nioche
> Illegal Charset Name crashes HTMLPar
[
https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated TIKA-454:
---
Attachment: TIKA-454.patch
Trivial fix - simply catch the exception and let the guesswork begin.
The
Reporter: Julien Nioche
Fix For: 0.8
As reported by Andrzej [1], the HTMLParser crashes when the charset found in
meta is illegal e.g.
[1]
http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3c4c2a102d.7090...@getopt.org%3e
--
This message is automatically
[
https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883660#action_12883660
]
Julien Nioche commented on TIKA-448:
I have seen similar cases with FLV when the con
Same here
+1
On 21 June 2010 18:00, Ken Krugler wrote:
> Hi Jukka,
>
> I can't think of any cons, so +1
>
> -- Ken
>
>
> On Jun 21, 2010, at 3:02am, Jukka Zitting wrote:
>
> Hi,
>>
>> The PDFBox web site [1] is now managed using the new svnpubsub
>> mechanism set up by the infra team. Basica
), text analysis and I recently started an open
source project named Behemoth which allows to scale text analysis
applications using Hadoop.
Best,
Julien Nioche
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com
Julien
On 5 June 2010 23:42, Mattmann
[
https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871623#action_12871623
]
Julien Nioche commented on TIKA-433:
Could do. I can't see a place in Tika
[
https://issues.apache.org/jira/browse/TIKA-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871585#action_12871585
]
Julien Nioche commented on TIKA-430:
The method mapSafeAttribute(String element
[
https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871544#action_12871544
]
Julien Nioche commented on TIKA-433:
You can do that with [Behemoth|
95 matches
Mail list logo