Please try again with the latest version of the project, I hopefully fixed this at https://github.com/centic9/CommonCrawlDocumentDownload now.
Thanks... Dominik. On Mon, Jun 1, 2015 at 6:32 PM, Dominik Stadler <[email protected]> wrote: > That's likely on my side, sorry, I'll take a look.... > > Dominik > > Am 01.06.2015 16:51 schrieb "Allison, Timothy B." <[email protected]>: >> >> Dominik, >> Thank you for making this available! I'm trying to build/run now, and >> I'm getting this...is this user error? >> >> >> >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20: >> error: package org.dstadler.commons.testing does not exist >> import org.dstadler.commons.testing.MockRESTServer; >> ^ >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21: >> error: package org.dstadler.commons.testing does not exist >> import org.dstadler.commons.testing.TestHelpers; >> ^ >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31: >> error: package org.dstadler.commons.testing does not exist >> >> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class); >> ^ >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158: >> error: cannot find symbol >> try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK, >> "text/plain", "Ok")) { >> ^ >> symbol: class MockRESTServer >> location: class UtilsTest >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158: >> error: cannot find symbol >> try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK, >> "text/plain", "Ok")) { >> ^ >> symbol: class MockRESTServer >> location: class UtilsTest >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171: >> error: cannot find symbol >> try (MockRESTServer server = new >> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) { >> ^ >> symbol: class MockRESTServer >> location: class UtilsTest >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171: >> error: cannot find symbol >> try (MockRESTServer server = new >> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) { >> ^ >> symbol: class MockRESTServer >> location: class UtilsTest >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179: >> error: cannot find symbol >> TestHelpers.assertContains(e, "500", "localhost", >> Integer.toString(server.getPort())); >> ^ >> symbol: variable TestHelpers >> location: class UtilsTest >> >> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205: >> error: package org.dstadler.commons.testing does not exist >> >> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class); >> ^ >> 9 errors >> :compileTestJava FAILED >> >> -----Original Message----- >> From: Dominik Stadler [mailto:[email protected]] >> Sent: Wednesday, April 22, 2015 4:07 PM >> To: POI Developers List >> Cc: [email protected]; [email protected]; [email protected] >> Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika >> as part of CommonCrawl? >> >> Hi, >> >> I have now published a first version of a tool to download binary data >> of certain file types from the Common Crawl URL Index. Currently it >> only supports the previous index format, so the data is from around >> 2012/2013, but this also provides tons of files for mass-testing of >> our frameworks. >> >> I used a small part of the files to run some integration testing >> locally and immediately found a few issues where specially formatted >> files broke Apache POI. >> >> The project is currently available at >> https://github.com/centic9/CommonCrawlDocumentDownload, it has options >> for downloading files as well as first retrieving a list of all >> interesting files and then downloading them later. But it should also >> be easily possible to change it so it processes the files on-the-fly >> (if you want to spare the estimated >300G of disk space it will need >> for example to store files interesting for Apache POI testing). >> >> Naturally running this on Amazon EC2 machines can speed up the >> downloading a lot as then the network access to Amazon S3 is much >> faster. >> >> Please give it a try if you are interested and let me know what you think. >> >> Dominik. >> >> On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <[email protected]> >> wrote: >> > All, >> > >> > We just heard back from a very active member of Common Crawl. I don’t >> > want to clog up our dev lists with this discussion (more than I have!), but >> > I do want to invite all to participate in the discussion, planning and >> > potential patches. >> > >> > If you’d like to participate, please join us here: >> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 >> > >> > I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to >> > the Subject line. Please invite others who might have an interest in this >> > work. >> > >> > Best, >> > >> > Tim >> > >> > From: Allison, Timothy B. >> > Sent: Tuesday, April 07, 2015 8:39 AM >> > To: 'Stephen Merity'; [email protected] >> > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl? >> > >> > Stephen, >> > >> > Thank you very much for responding so quickly and for all of your work >> > on Common Crawl. I don’t want to speak for all of us, but given the >> > feedback I’ve gotten so far from some of the dev communities, I think we >> > would very much appreciate the chance to be tested on a monthly basis as >> > part of the regular Common Crawl process. >> > >> > I think we’ll still want to run more often in our own sandbox(es) on >> > the slice of CommonCrawl we have, but the monthly testing against new data, >> > from my perspective at least, would be a huge win for all of us. >> > >> > In addition to parsing binaries and extracting text, Tika (via >> > PDFBox, POI and many others) can also offer metadata (e.g. exif from >> > images), which users of CommonCrawl might find of use. >> > >> > I’ll forward this to some of the relevant dev lists to invite others >> > to participate in the discussion on the common-crawl list. >> > >> > >> > Thank you, again. I very much look forward to collaborating. >> > >> > Best, >> > >> > Tim >> > >> > From: Stephen Merity [mailto:[email protected]] >> > Sent: Tuesday, April 07, 2015 3:57 AM >> > To: [email protected]<mailto:[email protected]> >> > Cc: [email protected]<mailto:[email protected]>; >> > [email protected]<mailto:[email protected]>; >> > [email protected]<mailto:[email protected]>; >> > [email protected]<mailto:[email protected]>; >> > [email protected]<mailto:[email protected]> >> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl? >> > >> > Hi Tika team! >> > >> > We'd certainly be interested in working with Apache Tika on such an >> > undertaking. At the very least, we're glad that Julien has provided you >> > with >> > content to battle test Tika with! >> > >> > As you've noted, the text extraction performed to produce WET files are >> > focused primarily on HTML files, leaving many other file types not covered. >> > The existing text extraction is quite efficient and part of the same >> > process >> > that generates the WAT file, meaning there's next to no overhead. >> > Performing >> > extraction with Tika at the scale of Common Crawl would be an interesting >> > challenge. Running it as a once off wouldn't likely be too much of a >> > challenge and would also give Tika the benefit of a wider variety of >> > documents (both well formed and malformed) to test against. Running it on a >> > frequent basis or as part of the crawl pipeline would be more challenging >> > but something we can certainly discuss, especially if there's strong >> > community desire for it! >> > >> > On Fri, Apr 3, 2015 at 5:23 AM, >> > <[email protected]<mailto:[email protected]>> wrote: >> > CommonCrawl currently has the WET format that extracts plain text from >> > web pages. My guess is that this is text stripping from text-y formats. >> > Let me know if I'm wrong! >> > >> > Would there be any interest in adding another format: WETT (WET-Tika) or >> > supplementing the current WET by using Tika to extract contents from binary >> > formats too: PDF, MSWord, etc. >> > >> > Julien Nioche kindly carved out 220 GB for us to experiment with on >> > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace >> > vm. But, I'm wondering now if it would make more sense to have CommonCrawl >> > run Tika as part of its regular process and make the output available in >> > one >> > of your standard formats. >> > >> > CommonCrawl consumers would get Tika output, and the Tika dev community >> > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces >> > to >> > help prioritize bug fixes. >> > >> > Cheers, >> > >> > Tim >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "Common Crawl" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an email to >> > [email protected]<mailto:[email protected]>. >> > To post to this group, send email to >> > [email protected]<mailto:[email protected]>. >> > Visit this group at http://groups.google.com/group/common-crawl. >> > For more options, visit https://groups.google.com/d/optout. >> > >> > >> > >> > -- >> > Regards, >> > Stephen Merity >> > Data Scientist @ Common Crawl >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
