Dominik,
Thank you for making this available! I'm trying to build/run now, and I'm
getting this...is this user error?
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
error: package org.dstadler.commons.testing does not exist
import org.dstadler.commons.testing.MockRESTServer;
^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
error: package org.dstadler.commons.testing does not exist
import org.dstadler.commons.testing.TestHelpers;
^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
error: package org.dstadler.commons.testing does not exist
org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
error: cannot find symbol
try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
"text/plain", "Ok")) {
^
symbol: class MockRESTServer
location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
error: cannot find symbol
try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
"text/plain", "Ok")) {
^
symbol: class MockRESTServer
location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
error: cannot find symbol
try (MockRESTServer server = new
MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
^
symbol: class MockRESTServer
location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
error: cannot find symbol
try (MockRESTServer server = new
MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
^
symbol: class MockRESTServer
location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
error: cannot find symbol
TestHelpers.assertContains(e, "500", "localhost",
Integer.toString(server.getPort()));
^
symbol: variable TestHelpers
location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
error: package org.dstadler.commons.testing does not exist
org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
^
9 errors
:compileTestJava FAILED
-----Original Message-----
From: Dominik Stadler [mailto:[email protected]]
Sent: Wednesday, April 22, 2015 4:07 PM
To: POI Developers List
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika as
part of CommonCrawl?
Hi,
I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.
I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.
The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).
Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.
Please give it a try if you are interested and let me know what you think.
Dominik.
On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <[email protected]> wrote:
> All,
>
> We just heard back from a very active member of Common Crawl. I don’t want
> to clog up our dev lists with this discussion (more than I have!), but I do
> want to invite all to participate in the discussion, planning and potential
> patches.
>
> If you’d like to participate, please join us here:
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
> I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the
> Subject line. Please invite others who might have an interest in this work.
>
> Best,
>
> Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; [email protected]
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
> Thank you very much for responding so quickly and for all of your work on
> Common Crawl. I don’t want to speak for all of us, but given the feedback
> I’ve gotten so far from some of the dev communities, I think we would very
> much appreciate the chance to be tested on a monthly basis as part of the
> regular Common Crawl process.
>
> I think we’ll still want to run more often in our own sandbox(es) on the
> slice of CommonCrawl we have, but the monthly testing against new data, from
> my perspective at least, would be a huge win for all of us.
>
> In addition to parsing binaries and extracting text, Tika (via PDFBox, POI
> and many others) can also offer metadata (e.g. exif from images), which users
> of CommonCrawl might find of use.
>
> I’ll forward this to some of the relevant dev lists to invite others to
> participate in the discussion on the common-crawl list.
>
>
> Thank you, again. I very much look forward to collaborating.
>
> Best,
>
> Tim
>
> From: Stephen Merity [mailto:[email protected]]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: [email protected]<mailto:[email protected]>
> Cc: [email protected]<mailto:[email protected]>;
> [email protected]<mailto:[email protected]>;
> [email protected]<mailto:[email protected]>;
> [email protected]<mailto:[email protected]>;
> [email protected]<mailto:[email protected]>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an
> undertaking. At the very least, we're glad that Julien has provided you with
> content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are
> focused primarily on HTML files, leaving many other file types not covered.
> The existing text extraction is quite efficient and part of the same process
> that generates the WAT file, meaning there's next to no overhead. Performing
> extraction with Tika at the scale of Common Crawl would be an interesting
> challenge. Running it as a once off wouldn't likely be too much of a
> challenge and would also give Tika the benefit of a wider variety of
> documents (both well formed and malformed) to test against. Running it on a
> frequent basis or as part of the crawl pipeline would be more challenging but
> something we can certainly discuss, especially if there's strong community
> desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM,
> <[email protected]<mailto:[email protected]>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages. My guess is that this is text stripping from text-y formats. Let me
> know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.
> But, I'm wondering now if it would make more sense to have CommonCrawl run
> Tika as part of its regular process and make the output available in one of
> your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to
> help prioritize bug fixes.
>
> Cheers,
>
> Tim
> --
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
> [email protected]<mailto:[email protected]>.
> To post to this group, send email to
> [email protected]<mailto:[email protected]>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]