RE: FW: Any interest in running Apache Tika as part of CommonCrawl?

Allison, Timothy B. Fri, 03 Apr 2015 10:58:52 -0700

POI Colleagues,

  If you'd like a table (or better yet, the h2 database) of the results of runs 
against govdocs1 with stack traces, let me know where I should post it.  This 
came in quite handy for https://issues.apache.org/jira/browse/TIKA-1512, where 
the reporter couldn't share the document.   I'll be rerunning this process soon 
once we have a release candidate for Tike 1.8.


    The downside to govdocs1 for POI is that there are very few docx/pptx/xlsx. 
 I'm hoping to unzip the slice of common crawl that Julien Nioche grabbed for 
us on TIKA-1302 soon, and I'll let you know what's in there.

             Best,

                       Tim 
  

-----Original Message-----
From: Andreas Beeker [mailto:[email protected]] 
Sent: Friday, April 03, 2015 1:12 PM
To: POI Developers List
Cc: [email protected]
Subject: Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

Hi,

similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically "left over 
bytes" error
or other record combinations which I can't reproduce with my MS/Libre office 
versions.

I haven't thought about how it's actually done, but I think logging the 
location in the
integration tests and later manually checking the corresponding files should be
sufficient.

Best wishes,
Andi



On 03.04.2015 17:51, Dominik Stadler wrote:
> Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> ...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]> 
> wrote:
>> All,
>>
>> What do you think?
>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: FW: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to