Dominik,
I've downloaded one of WARC files (from CC-MAIN-2015-01,
https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-00000-ip-10-180-212-252.ec2.internal.warc.gz,
1.2GB) and
it contains at least PDFs and DOCs in crawled data.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 18:52, Dominik Stadler <[email protected]>:

Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> Actually I am currently playing around with the Common Crawl URL Index
> (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
> a much smaller sized download (230G) and only contains URLs without
> all the additional information.
>
> The index is a bit outdated and currently only covers half of the full
> common crawl, however there are people working on refreshing it for
> the latest crawls.
>
> I wrote a small app which extracts interesting URLs out of these (aka
> files that POI should be able to open), resulting in aprox. 6.6
> million links! Based on some tests for the full download there would
> be around 3.3 million documents requiring approximately 3TB of
> storage. Note that this is still an old crawl with only half of the
> data included, so a current crawl will be considerably bigger!
>
> Running them through the integration testing that we added in POI
> (which performs text and property extraction but also some other
> POI-related actions) already showed a few cases where slightly
> off-spec documents can cause bugs to appear, some initial related
> commits will follow shortly...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]>
> wrote:
> > All,
> >
> > What do you think?
> >
> >
> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
> >
> >
> > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected]
> <mailto:[email protected]> wrote:
> > CommonCrawl currently has the WET format that extracts plain text from
> web pages.  My guess is that this is text stripping from text-y formats.
> Let me know if I'm wrong!
> >
> > Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
> >
> > Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
> >
> > CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
> >
> > Cheers,
> >
> >           Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to