Re: [VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-28 Thread Julien Nioche
Thanks Tim! Compiled StormCrawler with Tika 2.9.0 and ran a crawl without noticing any issues. +1 (non binding) to release Julien On Wed, 23 Aug 2023 at 15:50, Tim Allison wrote: > A candidate for the Tika 2.9.0 release is available at: > https://dist.apache.org/repos/dist/dev/tika/2.9.0 > >

Re: [VOTE] Release Apache Tika 2.8.0 Candidate #2

2023-05-12 Thread Julien Nioche
Thanks Tim, I have tried with the RC2 and it is now working fine. +1 from me J On Thu, 11 May 2023 at 21:08, Tim Allison wrote: > A candidate for the Tika 2.8.0 release is available at: > https://dist.apache.org/repos/dist/dev/tika/2.8.0 > > The release candidate is a zip archive of the sourc

Re: [VOTE] Apache Tika 2.8.0 Release Candidate 1

2023-05-11 Thread Julien Nioche
Thanks Tim, I am testing 2.8.0 with StormCrawler Apart from a lot of warning about missing classes like *Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream * I am also getting a failed test when trying to extract text from an embedd

Re: [VOTE] Release Apache Tika 2.7.0 Candidate #1

2023-02-03 Thread Julien Nioche
Hi Tim, Thanks for the release. I ran Tika 2.7.0 with StormCrawler and did not notice any problems. Cheers Julien On Tue, 31 Jan 2023 at 19:13, Tim Allison wrote: > A candidate for the Tika 2.7.0 release is available at: > https://dist.apache.org/repos/dist/dev/tika/2.7.0 > > The release cand

[jira] [Closed] (TIKA-2269) NPE with FeedParser

2017-02-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed TIKA-2269. --- thanks for committing [~talli...@mitre.org] > NPE with FeedPar

[jira] [Created] (TIKA-2269) NPE with FeedParser

2017-02-20 Thread Julien Nioche (JIRA)
Julien Nioche created TIKA-2269: --- Summary: NPE with FeedParser Key: TIKA-2269 URL: https://issues.apache.org/jira/browse/TIKA-2269 Project: Tika Issue Type: Bug Components: parser

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Julien Nioche
riginal Message- > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] > Sent: Thursday, October 20, 2016 8:34 AM > To: dev@tika.apache.org > Subject: Re: [VOTE] Apache Tika 1.14 Release Candidate #1 > > Hi > > Am getting the following when running 'mvn clean pa

Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Julien Nioche
Hi Am getting the following when running 'mvn clean package', have I forgotten something obvious? Julien *Failed tests: * * ForkParserIntegrationTest.testParserHandlingOfNonSerializable:210 expected: but was:* *Tests in error: * * ForkParserIntegrationTest.testAttachingADebuggerOnTheForkedParse

Re: [VOTE] Moving SCM to Git

2016-01-13 Thread Julien Nioche
+1 On 2 January 2016 at 04:30, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Everyone, > > DISCUSS thread here: http://s.apache.org/wVE > > Time to officially VOTE on moving Tika to Git. I’ve made a wiki > page for our SCM explaining how to use Git at Apache, and how to >

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049248#comment-15049248 ] Julien Nioche commented on TIKA-1599: - Don't think that this is the version

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049239#comment-15049239 ] Julien Nioche commented on TIKA-1599: - Hi [~talli...@mitre.org] Haven't kept

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Julien Nioche
56, Julien Nioche wrote: > and I haven't tested it with Nutch either... > > On 20 April 2015 at 15:46, Julien Nioche > wrote: > >> I haven't tested the RC with Behemoth, it will probably have the same >> issue but I'll do like you and defer the update if t

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Julien Nioche
and I haven't tested it with Nutch either... On 20 April 2015 at 15:46, Julien Nioche wrote: > I haven't tested the RC with Behemoth, it will probably have the same > issue but I'll do like you and defer the update if that's the case. > > On 20 April 201

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Julien Nioche
I haven't tested the RC with Behemoth, it will probably have the same issue but I'll do like you and defer the update if that's the case. On 20 April 2015 at 15:23, Ken Krugler wrote: > > > From: Allison, Timothy B. > > Sent: April 20, 2015 5:11:04am PDT > > To: dev@tika.apache.org > > Subject:

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-14 Thread Julien Nioche
ed in favor of rc1! > > Details... > > I reran against govdocs1, and there aren't any major surprises. > > On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that > Julien Nioche created for us, and I ran against that as well. That turned > up TIKA-1605 and an

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-04-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487012#comment-14487012 ] Julien Nioche commented on TIKA-1599: - FWIW we've just added a JSoup based

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228305#comment-14228305 ] Julien Nioche commented on TIKA-1302: - FYI have extracted data from the CommonC

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226397#comment-14226397 ] Julien Nioche commented on TIKA-1302: - Sure, will get back to you re-details of

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226336#comment-14226336 ] Julien Nioche commented on TIKA-1302: - Hi [~talli...@apache.org] It would be eas

[jira] [Commented] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217749#comment-14217749 ] Julien Nioche commented on TIKA-595: Thanks Dave! > HtmlHandler does not

[jira] [Updated] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-595: --- Fix Version/s: 1.7 > HtmlHandler does not support multivalue metad

[jira] [Updated] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-595: --- Attachment: TIKA-595.patch Any reason why we wouldn't want to have multiple values in the metada

Re: Parse Html with Tika

2014-11-03 Thread Julien Nioche
Hi Linh You can specify a mapper to control what the html parser will filter or not. see https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639 for an example Julien On Monday, 3 November 2014, Linh Tang wrote: > Dear All, > > I am Phuong Linh, > I am u

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001612#comment-14001612 ] Julien Nioche commented on TIKA-1302: - How large do you want that batch to be? I

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-10 Thread Julien Nioche
Hi Dave, +1 from me. Compiled fine on Linux Mint + tested Maven artefacts with Behemoth and ran a parse without problems. Thanks for doing this. Julien On 9 February 2014 22:53, Dave Meikle wrote: > Hi Guys, > > A new release candidate for the Tika 1.5 release is now available at: > http://p

Re: [VOTE] Apache Tika 1.5 RC1

2014-02-05 Thread Julien Nioche
Hi Dave Am trying to compile from src and am getting [ERROR] The build could not read 1 project -> [Help 1] [ERROR] [ERROR] The project org.apache.tika:tika-java7:1.5-SNAPSHOT (/data/tika-1.5/tika-java7/pom.xml) has 1 error [ERROR] Non-resolvable parent POM: Could not find artifact org.apac

Re: [DISCUSS] Integrate Apache Any23 into Apache Tika

2013-10-18 Thread Julien Nioche
Hi, I had a look at Any23 some time ago and found that it overlapped with quite a few other projects indeed but could (should?) have either relied on those projects (e.g. parsing and mimetype stuff to Tika) or delegated the functionality altogether (e.g. crawling to Nutch) instead of reinventing t

[ANNOUNCEMENT] 0.3 release of crawler-commons

2013-10-11 Thread Julien Nioche
Hi, Just to let you know that we have just release the version 0.3 of crawler-commons. Crawler-commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce du

Re: Pluggable language detection

2012-03-22 Thread Julien Nioche
from the way we deal with the parsers? Thanks for your comments Julien On 21 March 2012 16:55, Ken Krugler wrote: > > On Mar 21, 2012, at 8:51am, Julien Nioche wrote: > > > Hi guys, > > > > Just wondering about the best way to make the language detection > pluggabl

Re: % of different content types out there on the web

2012-01-29 Thread Julien Nioche
That could be an interesting experiment to do with the commoncrawl dataset and Tika on Behemoth. Assuming of course that the detection is done correctly by Tika. Does anyone have a spare cluster on EC2 ;-) ? Julien On 28 January 2012 02:01, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.go

Re: [VOTE] Add Any23 to the Apache Incubator

2011-09-27 Thread Julien Nioche
+1 from me On 27 September 2011 06:18, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Folks, > > OK, the proposal period had died now and I'm now calling a formal VOTE on > the Any23 proposal located here: > > http://wiki.apache.org/incubator/Any23Proposal > > Proposal text

Re: index video and image format with nutch 1.3?

2011-09-10 Thread Julien Nioche
This is not a Tika issue. Ask this on the Nutch user list instead. On 9 September 2011 22:34, hadi wrote: > when i want to index video file with nutch 1.3 i get the following error : > > *Error parsing: file:///D:/film.avi: failed(2,0): Can't retrieve Tika > parser > for > mime-type video/x-ms

[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

2011-08-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-612: --- Attachment: Tika-612.patch Patch which allows to specify the options via the Context object. WDYT

[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089498#comment-13089498 ] Julien Nioche commented on TIKA-696: The text of the watermark can be found towards

[jira] [Updated] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-696: --- Attachment: Demo+with+watermark.docx .docx version generated with MS Office Can't see the wate

[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089480#comment-13089480 ] Julien Nioche commented on TIKA-696: Can't see the watermark when saving and

[jira] [Updated] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-696: --- Attachment: Demo with watermark.doc Attached doc file containing a watermark > Extract waterma

[jira] [Created] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Attachments: Demo with watermark.doc It would be nice to store the text of a watermark as metadata. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Towards 1.0

2011-05-21 Thread Julien Nioche
Hi It's a few months since 0.9 and our Tika in Action book is soon ready > for print, so I think it's good time to start planning for the 1.0 > release. > > There are a few odds and ends that I'd still like to sort out in the > trunk, but overall I think we're in a pretty much ready for the switch

[jira] [Assigned] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-05-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned TIKA-657: -- Assignee: Julien Nioche > Email parser gets into trouble on malformed html in enron cor

[jira] [Commented] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-05-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030467#comment-13030467 ] Julien Nioche commented on TIKA-657: Good idea. We need more tutorials and example

[jira] [Commented] (TIKA-649) NPE while parsing a .docx

2011-04-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026266#comment-13026266 ] Julien Nioche commented on TIKA-649: Sorry, should have tested on the trunk as

[jira] [Created] (TIKA-649) NPE while parsing a .docx

2011-04-27 Thread Julien Nioche (JIRA)
NPE while parsing a .docx --- Key: TIKA-649 URL: https://issues.apache.org/jira/browse/TIKA-649 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Reporter: Julien

[jira] [Updated] (TIKA-649) NPE while parsing a .docx

2011-04-27 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-649: --- Attachment: Popcorn.docx Wikipedia content on popcorn within a docx page > NPE while parsing a .d

Invisible text displayed for headings in doc files

2011-04-06 Thread Julien Nioche
Hi guys, We are currently getting duplicated text for the heading from .doc files e.g. *29. No Partnership or Agency XE "29. No Partnership or Agency" * XE seems to be a flag in MS Word http://taxonomist.tripod.com/indexing/wordflags.html but I don't think it should be displayed. Have I missed

[jira] Closed: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed TIKA-611. -- > PDFParser mixes the text from separate colu

[jira] Resolved: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved TIKA-611. Resolution: Fixed Committed revision 1079705. Opened TIKA-612 for the params via ParseContext

[jira] Created: (TIKA-612) Specify PDFBox options via ParseContext

2011-03-09 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

[jira] Commented: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004035#comment-13004035 ] Julien Nioche commented on TIKA-611: The current behaviour is incorrect not only

[jira] Commented: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003884#comment-13003884 ] Julien Nioche commented on TIKA-611: No objections? Shall I commit this? > PD

[jira] Created: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-07 Thread Julien Nioche (JIRA)
: 0.9 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.0 As reported on the dev list by Michael Schmitz : bq. I don't think the current snapshot is parsing articles (pdfs with columns/beads) correctly. The text is not in the write order

[jira] Resolved: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-03-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved TIKA-597. Resolution: Fixed Fix Version/s: 1.0 Committed revision 1076300 Thanks Benson > Bo

[jira] Commented: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-03-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001497#comment-13001497 ] Julien Nioche commented on TIKA-597: Benson, I can't see any TikaRuntimeExc

[jira] Assigned: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-03-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned TIKA-597: -- Assignee: Julien Nioche (was: Chris A. Mattmann) > Bogus exception handler

Re: [VOTE] Apache Tika 0.9 Release Candidate #1

2011-02-15 Thread Julien Nioche
> > > Please vote on releasing these packages as Apache Tika 0.9. The vote is > open > for the next 72 hours. Only votes from Tika PMC are binding, but everyone > is welcome to check the release candidate and voice their approval or > disapproval. The vote passes if at least three binding +1 votes

[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-11-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965286#action_12965286 ] Julien Nioche commented on TIKA-461: patch -p1 failed peb...@lucid-vostro:/data/

[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-11-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965271#action_12965271 ] Julien Nioche commented on TIKA-461: Benjamin, thanks for your patch. Could you gene

[jira] Updated: (TIKA-461) RFC822 messages not parsed

2010-11-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-461: --- Attachment: testRFC822-multipart Test document for mail parsing with multiparts, text + html

Re: Furthering Along TIKA-461

2010-11-25 Thread Julien Nioche
Hi Ben, Great! I still haven't found the time to work on Nick's suggestions but you can definitely work on the tests if you want to and add some of the emails you mentioned. Having some cases of multipart with HTML and txt content + images and attachments would be good. Thanks Julien On 25 Nove

[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-11-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930180#action_12930180 ] Julien Nioche commented on TIKA-461: Nope. I was planning to refactor the parser f

[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-09-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915708#action_12915708 ] Julien Nioche commented on TIKA-461: Nick, Thanks for taking the time to revie

[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-09-27 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915269#action_12915269 ] Julien Nioche commented on TIKA-461: Hi guys, Could anyone have a look at the p

[jira] Updated: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-461: --- Issue Type: New Feature (was: Bug) changed from bug to new feature > RFC822 messages not par

[jira] Updated: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-461: --- Attachment: TIKA-461.patch This patch contains an initial version of the RFC822Parser which uses

[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906468#action_12906468 ] Julien Nioche commented on TIKA-461: I'll have a look at mime4j and try to

[jira] Assigned: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned TIKA-461: -- Assignee: Julien Nioche > RFC822 messages not par

[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-08-17 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899465#action_12899465 ] Julien Nioche commented on TIKA-463: Look good. I must be missing something obvious

[jira] Resolved: (TIKA-460) HTMLHandler misses treatment of A elements

2010-08-14 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved TIKA-460. Resolution: Fixed Committed revision 985444 The A elements are now processed correctly when using

[jira] Commented: (TIKA-460) HTMLHandler misses treatment of A elements

2010-08-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898335#action_12898335 ] Julien Nioche commented on TIKA-460: Hi Ken, correct. The A's get bypassed

Re: Post link to Tika in Action book on Tika website?

2010-08-02 Thread Julien Nioche
+1 from me On 2 August 2010 18:33, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Tika community, > > Jukka Zitting and I are working on the Tika in Action book [1]. How would > everyone feel about us posting a link to it on the Tika website [2]? > > If so, I'll prepare a p

[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-07-27 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958 ] Julien Nioche commented on TIKA-463: Am very tempted to push things one step further

[jira] Closed: (TIKA-466) Feed Parser

2010-07-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed TIKA-466. -- > Feed Parser > --- > > Key: TIKA-466 >

[jira] Commented: (TIKA-466) Feed Parser

2010-07-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890380#action_12890380 ] Julien Nioche commented on TIKA-466: Thanks Chris for reviewing and committin

[jira] Commented: (TIKA-147) Add Flash parser

2010-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889883#action_12889883 ] Julien Nioche commented on TIKA-147: There is http://www.jswiff.com/licensing/ w

[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-463: --- Attachment: TIKA-463.patch Patch which implements some of the ideas described in this issue

[jira] Updated: (TIKA-466) Feed Parser

2010-07-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-466: --- Attachment: TIKA-466.patch > Feed Parser > --- > > K

[jira] Created: (TIKA-466) Feed Parser

2010-07-16 Thread Julien Nioche (JIRA)
Feed Parser --- Key: TIKA-466 URL: https://issues.apache.org/jira/browse/TIKA-466 Project: Tika Issue Type: New Feature Components: parser Reporter: Julien Nioche Priority: Minor

[jira] Commented: (TIKA-460) HTMLHandler misses treatment of A elements

2010-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887718#action_12887718 ] Julien Nioche commented on TIKA-460: this would work if we had in the list of

[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716 ] Julien Nioche commented on TIKA-463: creating a LinksHtmlMapper : +1, that would

[jira] Updated: (TIKA-460) HTMLHandler misses treatment of A elements

2010-07-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-460: --- Attachment: TIKA-460.patch > HTMLHandler misses treatment of A eleme

[jira] Created: (TIKA-460) HTMLHandler misses treatment of A elements

2010-07-08 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 0.8 The A elements should be processed before any other safe element, otherwise it never happens -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the

[jira] Created: (TIKA-458) Specify HTMLHandler via Context

2010-07-07 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Attachments: TIKA-458.patch One of the recent changes on Tika is the possibility to specify a custom HTMLMapper via the Context - which I think is an elegant mechanism. I was wondering whether there would be a reason NOT to be able to do the same for the HTMLHandler

[jira] Created: (TIKA-457) HTMLParser gets an early event

2010-07-07 Thread Julien Nioche (JIRA)
HTMLParser gets an early event -- Key: TIKA-457 URL: https://issues.apache.org/jira/browse/TIKA-457 Project: Tika Issue Type: Bug Components: parser Reporter: Julien Nioche I am

[jira] Updated: (TIKA-458) Specify HTMLHandler via Context

2010-07-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-458: --- Attachment: TIKA-458.patch > Specify HTMLHandler via Cont

Specify HTMLHandler via Context

2010-07-07 Thread Julien Nioche
Hi guys, One of the recent changes on Tika is the possibility to specify a custom HTMLMapper via the Context - which I think is an elegant mechanism. I was wondering whether there would be a reason NOT to be able to do the same for the HTMLHandler and if nothing is passed via the Context, rely on

[jira] Closed: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed TIKA-454. -- Resolution: Fixed Committed revision 960487 > Illegal Charset Name crashes HTMLPar

[jira] Assigned: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned TIKA-454: -- Assignee: Julien Nioche > Illegal Charset Name crashes HTMLPar

[jira] Updated: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated TIKA-454: --- Attachment: TIKA-454.patch Trivial fix - simply catch the exception and let the guesswork begin. The

[jira] Created: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-02 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Fix For: 0.8 As reported by Andrzej [1], the HTMLParser crashes when the charset found in meta is illegal e.g. [1] http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3c4c2a102d.7090...@getopt.org%3e -- This message is automatically

[jira] Commented: (TIKA-448) Tika FLVParser hangs

2010-06-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883660#action_12883660 ] Julien Nioche commented on TIKA-448: I have seen similar cases with FLV when the con

Re: svnpubsub for the Tika web site

2010-06-21 Thread Julien Nioche
Same here +1 On 21 June 2010 18:00, Ken Krugler wrote: > Hi Jukka, > > I can't think of any cons, so +1 > > -- Ken > > > On Jun 21, 2010, at 3:02am, Jukka Zitting wrote: > > Hi, >> >> The PDFBox web site [1] is now managed using the new svnpubsub >> mechanism set up by the infra team. Basica

Re: Welcome Julien Nioche, new Tika PMC member and committer

2010-06-06 Thread Julien Nioche
), text analysis and I recently started an open source project named Behemoth which allows to scale text analysis applications using Hadoop. Best, Julien Nioche -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com Julien On 5 June 2010 23:42, Mattmann

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871623#action_12871623 ] Julien Nioche commented on TIKA-433: Could do. I can't see a place in Tika&#

[jira] Commented: (TIKA-430) Automatically let all valid XHTML 1.0 attributes through from HTML documents

2010-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871585#action_12871585 ] Julien Nioche commented on TIKA-430: The method mapSafeAttribute(String element

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871544#action_12871544 ] Julien Nioche commented on TIKA-433: You can do that with [Behemoth|