Hi,

I am very interested as I am following the Common Crawl activity for
some time already. It sounds like a neat idea to do the check already
when the crawl is done, are the binary documents already part of the
crawl-data?

Actually I am currently playing around with the Common Crawl URL Index
(http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
a much smaller sized download (230G) and only contains URLs without
all the additional information.

The index is a bit outdated and currently only covers half of the full
common crawl, however there are people working on refreshing it for
the latest crawls.

I wrote a small app which extracts interesting URLs out of these (aka
files that POI should be able to open), resulting in aprox. 6.6
million links! Based on some tests for the full download there would
be around 3.3 million documents requiring approximately 3TB of
storage. Note that this is still an old crawl with only half of the
data included, so a current crawl will be considerably bigger!

Running them through the integration testing that we added in POI
(which performs text and property extraction but also some other
POI-related actions) already showed a few cases where slightly
off-spec documents can cause bugs to appear, some initial related
commits will follow shortly...

Dominik.

On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]> wrote:
> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
> [email protected]<mailto:[email protected]> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let me 
> know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm. 
>  But, I'm wondering now if it would make more sense to have CommonCrawl run 
> Tika as part of its regular process and make the output available in one of 
> your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
> help prioritize bug fixes.
>
> Cheers,
>
>           Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to