Hi, I am very interested as I am following the Common Crawl activity for some time already. It sounds like a neat idea to do the check already when the crawl is done, are the binary documents already part of the crawl-data?
Actually I am currently playing around with the Common Crawl URL Index (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is a much smaller sized download (230G) and only contains URLs without all the additional information. The index is a bit outdated and currently only covers half of the full common crawl, however there are people working on refreshing it for the latest crawls. I wrote a small app which extracts interesting URLs out of these (aka files that POI should be able to open), resulting in aprox. 6.6 million links! Based on some tests for the full download there would be around 3.3 million documents requiring approximately 3TB of storage. Note that this is still an old crawl with only half of the data included, so a current crawl will be considerably bigger! Running them through the integration testing that we added in POI (which performs text and property extraction but also some other POI-related actions) already showed a few cases where slightly off-spec documents can cause bugs to appear, some initial related commits will follow shortly... Dominik. On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]> wrote: > All, > > What do you think? > > > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 > > > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, > [email protected]<mailto:[email protected]> wrote: > CommonCrawl currently has the WET format that extracts plain text from web > pages. My guess is that this is text stripping from text-y formats. Let me > know if I'm wrong! > > Would there be any interest in adding another format: WETT (WET-Tika) or > supplementing the current WET by using Tika to extract contents from binary > formats too: PDF, MSWord, etc. > > Julien Nioche kindly carved out 220 GB for us to experiment with on > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm. > But, I'm wondering now if it would make more sense to have CommonCrawl run > Tika as part of its regular process and make the output available in one of > your standard formats. > > CommonCrawl consumers would get Tika output, and the Tika dev community > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to > help prioritize bug fixes. > > Cheers, > > Tim > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
