+1 this makes immense sense to me. Thanks Juls and Tim. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: "[email protected]" <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, April 3, 2015 at 5:35 AM To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl? >All, > What do we think? > >On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote: > >CommonCrawl currently has the WET format that extracts plain text from >web pages. My guess is that this is text stripping from text-y formats. >Let me know if I'm wrong! > > >Would there be any interest in adding another format: WETT (WET-Tika) or >supplementing the current WET by using Tika to extract contents from >binary formats too: PDF, MSWord, etc. > > >Julien Nioche kindly carved out 220 GB for us to experiment with on >TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a >Rackspace vm. But, I'm wondering now if it would make more sense to have >CommonCrawl run Tika as part of its regular process and make the output >available in one of your standard formats. > > > >CommonCrawl consumers would get Tika output, and the Tika dev community >(including its dependencies, PDFBox, POI, etc.) could get the stacktraces >to help prioritize bug fixes. > > >Cheers, > > > Tim > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
