Dominik, I've downloaded one of WARC files (from CC-MAIN-2015-01, https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-00000-ip-10-180-212-252.ec2.internal.warc.gz, 1.2GB) and it contains at least PDFs and DOCs in crawled data.
-- Best regards, Konstantin Gribov пт, 3 апр. 2015 г. в 18:52, Dominik Stadler <[email protected]>: Hi, > > I am very interested as I am following the Common Crawl activity for > some time already. It sounds like a neat idea to do the check already > when the crawl is done, are the binary documents already part of the > crawl-data? > > Actually I am currently playing around with the Common Crawl URL Index > (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is > a much smaller sized download (230G) and only contains URLs without > all the additional information. > > The index is a bit outdated and currently only covers half of the full > common crawl, however there are people working on refreshing it for > the latest crawls. > > I wrote a small app which extracts interesting URLs out of these (aka > files that POI should be able to open), resulting in aprox. 6.6 > million links! Based on some tests for the full download there would > be around 3.3 million documents requiring approximately 3TB of > storage. Note that this is still an old crawl with only half of the > data included, so a current crawl will be considerably bigger! > > Running them through the integration testing that we added in POI > (which performs text and property extraction but also some other > POI-related actions) already showed a few cases where slightly > off-spec documents can cause bugs to appear, some initial related > commits will follow shortly... > > Dominik. > > On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]> > wrote: > > All, > > > > What do you think? > > > > > > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 > > > > > > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] > <mailto:[email protected]> wrote: > > CommonCrawl currently has the WET format that extracts plain text from > web pages. My guess is that this is text stripping from text-y formats. > Let me know if I'm wrong! > > > > Would there be any interest in adding another format: WETT (WET-Tika) or > supplementing the current WET by using Tika to extract contents from binary > formats too: PDF, MSWord, etc. > > > > Julien Nioche kindly carved out 220 GB for us to experiment with on > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace > vm. But, I'm wondering now if it would make more sense to have CommonCrawl > run Tika as part of its regular process and make the output available in > one of your standard formats. > > > > CommonCrawl consumers would get Tika output, and the Tika dev community > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces > to help prioritize bug fixes. > > > > Cheers, > > > > Tim > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
