RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Allison, Timothy B. Mon, 01 Jun 2015 07:52:34 -0700

Dominik,
  Thank you for making this available!  I'm trying to build/run now, and I'm 
getting this...is this user error?




/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
 error: package org.dstadler.commons.testing does not exist
import org.dstadler.commons.testing.MockRESTServer;
                                   ^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
 error: package org.dstadler.commons.testing does not exist
import org.dstadler.commons.testing.TestHelpers;
                                   ^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
 error: package org.dstadler.commons.testing does not exist
        
org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
                                    ^
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
 error: cannot find symbol
        try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK, 
"text/plain", "Ok")) {
             ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
 error: cannot find symbol
        try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK, 
"text/plain", "Ok")) {
                                         ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
 error: cannot find symbol
        try (MockRESTServer server = new 
MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
             ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
 error: cannot find symbol
        try (MockRESTServer server = new 
MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
                                         ^
  symbol:   class MockRESTServer
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
 error: cannot find symbol
                        TestHelpers.assertContains(e, "500", "localhost", 
Integer.toString(server.getPort()));
                        ^
  symbol:   variable TestHelpers
  location: class UtilsTest
/dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
 error: package org.dstadler.commons.testing does not exist
        
org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
                                    ^
9 errors
:compileTestJava FAILED

-----Original Message-----
From: Dominik Stadler [mailto:[email protected]] 
Sent: Wednesday, April 22, 2015 4:07 PM
To: POI Developers List
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika as 
part of CommonCrawl?

Hi,

I have now published a first version of a tool to download binary data
of certain file types from the Common Crawl URL Index. Currently it
only supports the previous index format, so the data is from around
2012/2013, but this also provides tons of files for mass-testing of
our frameworks.

I used a small part of the files to run some integration testing
locally and immediately found a few issues where specially formatted
files broke Apache POI.

The project is currently available at
https://github.com/centic9/CommonCrawlDocumentDownload, it has options
for downloading files as well as first retrieving a list of all
interesting files and then downloading them later. But it should also
be easily possible to change it so it processes the files on-the-fly
(if you want to spare the estimated >300G of disk space it will need
for example to store files interesting for Apache POI testing).

Naturally running this on Amazon EC2 machines can speed up the
downloading a lot as then the network access to Amazon S3 is much
faster.

Please give it a try if you are interested and let me know what you think.

Dominik.

On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <[email protected]> wrote:
> All,
>
>   We just heard back from a very active member of Common Crawl.  I don’t want 
> to clog up our dev lists with this discussion (more than I have!), but I do 
> want to invite all to participate in the discussion, planning and potential 
> patches.
>
>   If you’d like to participate, please join us here: 
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the 
> Subject line.  Please invite others who might have an interest in this work.
>
>          Best,
>
>                      Tim
>
> From: Allison, Timothy B.
> Sent: Tuesday, April 07, 2015 8:39 AM
> To: 'Stephen Merity'; [email protected]
> Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>
> Stephen,
>
>   Thank you very much for responding so quickly and for all of your work on 
> Common Crawl.  I don’t want to speak for all of us, but given the feedback 
> I’ve gotten so far from some of the dev communities, I think we would very 
> much appreciate the chance to be tested on a monthly basis as part of the 
> regular Common Crawl process.
>
>    I think we’ll still want to run more often in our own sandbox(es) on the 
> slice of CommonCrawl we have, but the monthly testing against new data, from 
> my perspective at least, would be a huge win for all of us.
>
>    In addition to parsing binaries and extracting text, Tika (via PDFBox, POI 
> and many others) can also offer metadata (e.g. exif from images), which users 
> of CommonCrawl might find of use.
>
>   I’ll forward this to some of the relevant dev lists to invite others to 
> participate in the discussion on the common-crawl list.
>
>
>   Thank you, again.  I very much look forward to collaborating.
>
>              Best,
>
>                          Tim
>
> From: Stephen Merity [mailto:[email protected]]
> Sent: Tuesday, April 07, 2015 3:57 AM
> To: [email protected]<mailto:[email protected]>
> Cc: [email protected]<mailto:[email protected]>; 
> [email protected]<mailto:[email protected]>; 
> [email protected]<mailto:[email protected]>; 
> [email protected]<mailto:[email protected]>; 
> [email protected]<mailto:[email protected]>
> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>
> Hi Tika team!
>
> We'd certainly be interested in working with Apache Tika on such an 
> undertaking. At the very least, we're glad that Julien has provided you with 
> content to battle test Tika with!
>
> As you've noted, the text extraction performed to produce WET files are 
> focused primarily on HTML files, leaving many other file types not covered. 
> The existing text extraction is quite efficient and part of the same process 
> that generates the WAT file, meaning there's next to no overhead. Performing 
> extraction with Tika at the scale of Common Crawl would be an interesting 
> challenge. Running it as a once off wouldn't likely be too much of a 
> challenge and would also give Tika the benefit of a wider variety of 
> documents (both well formed and malformed) to test against. Running it on a 
> frequent basis or as part of the crawl pipeline would be more challenging but 
> something we can certainly discuss, especially if there's strong community 
> desire for it!
>
> On Fri, Apr 3, 2015 at 5:23 AM, 
> <[email protected]<mailto:[email protected]>> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let me 
> know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm. 
>  But, I'm wondering now if it would make more sense to have CommonCrawl run 
> Tika as part of its regular process and make the output available in one of 
> your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
> help prioritize bug fixes.
>
> Cheers,
>
>           Tim
> --
> You received this message because you are subscribed to the Google Groups 
> "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to 
> [email protected]<mailto:[email protected]>.
> To post to this group, send email to 
> [email protected]<mailto:[email protected]>.
> Visit this group at http://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Regards,
> Stephen Merity
> Data Scientist @ Common Crawl

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to