[jira] [Commented] (TIKA-701) Fix problems with TemporaryFiles

2011-09-01 Thread Paul Jakubik (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095422#comment-13095422 ] Paul Jakubik commented on TIKA-701: --- This is a very important fix. Will it be rele

Supported Document Format web page out of date

2010-11-17 Thread Paul Jakubik
Hi, I was looking at http://tika.apache.org/0.8/formats.html and found several issues with it: - Says that it lists the formats supported by Tika 0.6 instead of 0.8. - Says that it has links to parser class javadocs when it doesn't. - Though the page promises that the parser class java d

Re: Hudson build is still unstable: Tika-trunk #395

2010-11-01 Thread Paul Jakubik
On Sun, Oct 31, 2010 at 8:16 PM, Jukka Zitting wrote: > I don't think the time is right yet for upgrading Tika's platform > requirement from Java 5 to 6. > > Java 6 has been out for almost 4 years ( http://en.wikipedia.org/wiki/Java_version_history). When will it be the right time to require Java

Faster charset detection or turn off charset detection?

2010-08-12 Thread Paul Jakubik
Hi, I'm wondering if there is a way to turn off character set detection when parsing with the AutoDetectParser, or if there is a way to speed up character set detection. I ran a test that converted 52,717 documents to text. The documents were emails embedded in a .tar file. With character set de

Metadata Discussion Status

2010-08-02 Thread Paul Jakubik
Hi, A while ago I added the http://wiki.apache.org/tika/MetadataDiscussion page to the Tika wiki. Since then, with the help of Jukka Zitting, a solution has been described for using the current Tika library to capture nested document metadata and associate that with the text extracted for each ne

Re: Packages and attributes

2010-08-02 Thread Paul Jakubik
I have added Juka Zitting's recursive metadata example to the Tika wiki at http://wiki.apache.org/tika/RecursiveMetadata. I also added some notes on what I did so I could get the metadata for a nested document along with the text for that document. Finally, I modified the http://wiki.apache.org/ti

Re: Packages and attributes

2010-07-16 Thread Paul Jakubik
Thank you for this example! Is there any chance this example could be added to the Tika wiki? On Fri, Jul 16, 2010 at 1:30 AM, Jukka Zitting wrote: > Hi, > > On Fri, Jul 16, 2010 at 2:43 AM, Paul Jakubik > wrote: > > On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting >

Re: Packages and attributes

2010-07-15 Thread Paul Jakubik
On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting wrote: > The way I recommend is to pass a custom Parser implementation through > the ParseContext. This gives you detailed access to each component > document. > > I looked at the code a little further, and I don't see exactly how I can do this. I am

Re: Packages and attributes

2010-07-15 Thread Paul Jakubik
On Thu, Jul 15, 2010 at 6:30 AM, Nick Burch wrote: > > Having looked through your proposed solutions, I can't see easy ways to > implement these use cases: > * enumerate all the Metadata objects at this depth > eg top level has one Metadata object (for the parent file), 1 level > down may have

Re: Packages and attributes

2010-07-15 Thread Paul Jakubik
On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting wrote: > The way I recommend is to pass a custom Parser implementation through > the ParseContext. This gives you detailed access to each component > document. > > You noted that this approach wouldn't work for recursive metadata. Why? > > I didn't th

Re: Packages and attributes

2010-07-14 Thread Paul Jakubik
On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch wrote: > Assuming I've got all of the above correct, it might be worth creating a > wiki page for this (probably + referencing jira entry), and start trying to > work up a proposed solution that'll handle all the above problems and use > cases. > I cre

Re: Packages and attributes

2010-07-12 Thread Paul Jakubik
On Mon, Jul 12, 2010 at 12:59 PM, Alex Ott wrote: > > May be it worth to separate metadata of top-level objects from metadata of > embedded objects? And allow to traverse through hierarchy of embedded > objects? And provide several implementations, something like: collector of > metadata for all

Re: Packages and attributes

2010-07-12 Thread Paul Jakubik
On Mon, Jul 12, 2010 at 10:37 AM, Nick Burch wrote: > On Mon, 12 Jul 2010, Paul Jakubik wrote: > >> I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like >> to get access to the metadata for the individual files inside of the >> packa

Packages and attributes

2010-07-12 Thread Paul Jakubik
Hi, I'm using tika to parse packages (zip, tar.gz, tar.bz2, etc.) and I'd like to get access to the metadata for the individual files inside of the package. It looks like there has been some discussion about how to provide the metadata, and from looking at the code I don't think any of the propos