Hi Markus, Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared to the size of the entire corpus?
Cheers, Chris On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote: > We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data > on > those two. However, we also explicitly filter out all/most unwanted suffixes. > We do have a lot of suffixes that we encountered so far. > > On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote: >> (sorry for the cross post) >> >> Hey Guys, >> >> I'm trying to find a good citation or estimate (if anyone has done one) >> that estimates the breakout (by % or some other metric) of content types >> out there out the web (with a whole web crawl or a meaningful >> representative dataset) that are non HTML. >> >> Anyone have any ideas about this? >> >> Thanks! >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- > Markus Jelsma - CTO - Openindex ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++