POI Colleagues,
If you'd like a table (or better yet, the h2 database) of the results of runs
against govdocs1 with stack traces, let me know where I should post it. This
came in quite handy for https://issues.apache.org/jira/browse/TIKA-1512, where
the reporter couldn't share the document. I'll be rerunning this process soon
once we have a release candidate for Tike 1.8.
The downside to govdocs1 for POI is that there are very few docx/pptx/xlsx.
I'm hoping to unzip the slice of common crawl that Julien Nioche grabbed for
us on TIKA-1302 soon, and I'll let you know what's in there.
Best,
Tim
-----Original Message-----
From: Andreas Beeker [mailto:[email protected]]
Sent: Friday, April 03, 2015 1:12 PM
To: POI Developers List
Cc: [email protected]
Subject: Re: FW: Any interest in running Apache Tika as part of CommonCrawl?
Hi,
similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically "left over
bytes" error
or other record combinations which I can't reproduce with my MS/Libre office
versions.
I haven't thought about how it's actually done, but I think logging the
location in the
integration tests and later manually checking the corresponding files should be
sufficient.
Best wishes,
Andi
On 03.04.2015 17:51, Dominik Stadler wrote:
> Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> ...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]>
> wrote:
>> All,
>>
>> What do you think?
>>
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]