Hi Markus,

Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes 
compared
to the size of the entire corpus?

Cheers,
Chris

On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:

> We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data 
> on 
> those two. However, we also explicitly filter out all/most unwanted suffixes. 
> We do have a lot of suffixes that we encountered so far.
> 
> On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
>> (sorry for the cross post)
>> 
>> Hey Guys,
>> 
>> I'm trying to find a good citation or estimate (if anyone has done one)
>> that estimates the breakout (by % or some other metric) of content types
>> out there out the web (with a whole web crawl or a meaningful
>> representative dataset) that are non HTML.
>> 
>> Anyone have any ideas about this?
>> 
>> Thanks!
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattm...@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> -- 
> Markus Jelsma - CTO - Openindex


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to