Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Dominik Stadler Mon, 01 Jun 2015 12:17:02 -0700

 Please try again with the latest version of the project, I hopefully
fixed this at https://github.com/centic9/CommonCrawlDocumentDownload
now.


Thanks... Dominik.

On Mon, Jun 1, 2015 at 6:32 PM, Dominik Stadler <[email protected]> wrote:
> That's likely on my side, sorry, I'll take a look....
>
> Dominik
>
> Am 01.06.2015 16:51 schrieb "Allison, Timothy B." <[email protected]>:
>>
>> Dominik,
>>   Thank you for making this available!  I'm trying to build/run now, and
>> I'm getting this...is this user error?
>>
>>
>>
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:20:
>> error: package org.dstadler.commons.testing does not exist
>> import org.dstadler.commons.testing.MockRESTServer;
>>                                    ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:21:
>> error: package org.dstadler.commons.testing does not exist
>> import org.dstadler.commons.testing.TestHelpers;
>>                                    ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/ExtensionsTest.java:31:
>> error: package org.dstadler.commons.testing does not exist
>>
>> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Extensions.class);
>>                                     ^
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
>> error: cannot find symbol
>>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
>> "text/plain", "Ok")) {
>>              ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:158:
>> error: cannot find symbol
>>         try (MockRESTServer server = new MockRESTServer(NanoHTTPD.HTTP_OK,
>> "text/plain", "Ok")) {
>>                                          ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
>> error: cannot find symbol
>>         try (MockRESTServer server = new
>> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>>              ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:171:
>> error: cannot find symbol
>>         try (MockRESTServer server = new
>> MockRESTServer(NanoHTTPD.HTTP_INTERNALERROR, "text/plain", "Ok")) {
>>                                          ^
>>   symbol:   class MockRESTServer
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:179:
>> error: cannot find symbol
>>                         TestHelpers.assertContains(e, "500", "localhost",
>> Integer.toString(server.getPort()));
>>                         ^
>>   symbol:   variable TestHelpers
>>   location: class UtilsTest
>>
>> /dev/commoncrawl-download/CommonCrawlDocumentDownload/src/test/java/org/dstadler/commoncrawl/UtilsTest.java:205:
>> error: package org.dstadler.commons.testing does not exist
>>
>> org.dstadler.commons.testing.PrivateConstructorCoverage.executePrivateConstructor(Utils.class);
>>                                     ^
>> 9 errors
>> :compileTestJava FAILED
>>
>> -----Original Message-----
>> From: Dominik Stadler [mailto:[email protected]]
>> Sent: Wednesday, April 22, 2015 4:07 PM
>> To: POI Developers List
>> Cc: [email protected]; [email protected]; [email protected]
>> Subject: Re: [COMPRESS and others] FW: Any interest in running Apache Tika
>> as part of CommonCrawl?
>>
>> Hi,
>>
>> I have now published a first version of a tool to download binary data
>> of certain file types from the Common Crawl URL Index. Currently it
>> only supports the previous index format, so the data is from around
>> 2012/2013, but this also provides tons of files for mass-testing of
>> our frameworks.
>>
>> I used a small part of the files to run some integration testing
>> locally and immediately found a few issues where specially formatted
>> files broke Apache POI.
>>
>> The project is currently available at
>> https://github.com/centic9/CommonCrawlDocumentDownload, it has options
>> for downloading files as well as first retrieving a list of all
>> interesting files and then downloading them later. But it should also
>> be easily possible to change it so it processes the files on-the-fly
>> (if you want to spare the estimated >300G of disk space it will need
>> for example to store files interesting for Apache POI testing).
>>
>> Naturally running this on Amazon EC2 machines can speed up the
>> downloading a lot as then the network access to Amazon S3 is much
>> faster.
>>
>> Please give it a try if you are interested and let me know what you think.
>>
>> Dominik.
>>
>> On Tue, Apr 7, 2015 at 2:48 PM, Allison, Timothy B. <[email protected]>
>> wrote:
>> > All,
>> >
>> >   We just heard back from a very active member of Common Crawl.  I don’t
>> > want to clog up our dev lists with this discussion (more than I have!), but
>> > I do want to invite all to participate in the discussion, planning and
>> > potential patches.
>> >
>> >   If you’d like to participate, please join us here:
>> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>> >
>> >   I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to
>> > the Subject line.  Please invite others who might have an interest in this
>> > work.
>> >
>> >          Best,
>> >
>> >                      Tim
>> >
>> > From: Allison, Timothy B.
>> > Sent: Tuesday, April 07, 2015 8:39 AM
>> > To: 'Stephen Merity'; [email protected]
>> > Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?
>> >
>> > Stephen,
>> >
>> >   Thank you very much for responding so quickly and for all of your work
>> > on Common Crawl.  I don’t want to speak for all of us, but given the
>> > feedback I’ve gotten so far from some of the dev communities, I think we
>> > would very much appreciate the chance to be tested on a monthly basis as
>> > part of the regular Common Crawl process.
>> >
>> >    I think we’ll still want to run more often in our own sandbox(es) on
>> > the slice of CommonCrawl we have, but the monthly testing against new data,
>> > from my perspective at least, would be a huge win for all of us.
>> >
>> >    In addition to parsing binaries and extracting text, Tika (via
>> > PDFBox, POI and many others) can also offer metadata (e.g. exif from
>> > images), which users of CommonCrawl might find of use.
>> >
>> >   I’ll forward this to some of the relevant dev lists to invite others
>> > to participate in the discussion on the common-crawl list.
>> >
>> >
>> >   Thank you, again.  I very much look forward to collaborating.
>> >
>> >              Best,
>> >
>> >                          Tim
>> >
>> > From: Stephen Merity [mailto:[email protected]]
>> > Sent: Tuesday, April 07, 2015 3:57 AM
>> > To: [email protected]<mailto:[email protected]>
>> > Cc: [email protected]<mailto:[email protected]>;
>> > [email protected]<mailto:[email protected]>;
>> > [email protected]<mailto:[email protected]>;
>> > [email protected]<mailto:[email protected]>;
>> > [email protected]<mailto:[email protected]>
>> > Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?
>> >
>> > Hi Tika team!
>> >
>> > We'd certainly be interested in working with Apache Tika on such an
>> > undertaking. At the very least, we're glad that Julien has provided you 
>> > with
>> > content to battle test Tika with!
>> >
>> > As you've noted, the text extraction performed to produce WET files are
>> > focused primarily on HTML files, leaving many other file types not covered.
>> > The existing text extraction is quite efficient and part of the same 
>> > process
>> > that generates the WAT file, meaning there's next to no overhead. 
>> > Performing
>> > extraction with Tika at the scale of Common Crawl would be an interesting
>> > challenge. Running it as a once off wouldn't likely be too much of a
>> > challenge and would also give Tika the benefit of a wider variety of
>> > documents (both well formed and malformed) to test against. Running it on a
>> > frequent basis or as part of the crawl pipeline would be more challenging
>> > but something we can certainly discuss, especially if there's strong
>> > community desire for it!
>> >
>> > On Fri, Apr 3, 2015 at 5:23 AM,
>> > <[email protected]<mailto:[email protected]>> wrote:
>> > CommonCrawl currently has the WET format that extracts plain text from
>> > web pages.  My guess is that this is text stripping from text-y formats.
>> > Let me know if I'm wrong!
>> >
>> > Would there be any interest in adding another format: WETT (WET-Tika) or
>> > supplementing the current WET by using Tika to extract contents from binary
>> > formats too: PDF, MSWord, etc.
>> >
>> > Julien Nioche kindly carved out 220 GB for us to experiment with on
>> > TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace
>> > vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
>> > run Tika as part of its regular process and make the output available in 
>> > one
>> > of your standard formats.
>> >
>> > CommonCrawl consumers would get Tika output, and the Tika dev community
>> > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces 
>> > to
>> > help prioritize bug fixes.
>> >
>> > Cheers,
>> >
>> >           Tim
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "Common Crawl" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to
>> > [email protected]<mailto:[email protected]>.
>> > To post to this group, send email to
>> > [email protected]<mailto:[email protected]>.
>> > Visit this group at http://groups.google.com/group/common-crawl.
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Stephen Merity
>> > Data Scientist @ Common Crawl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to