Fellow file-philes on [compress],
Sebastian Nagel has added file type id via Apache Tika to Common Crawl. While
Tika is not 100% accurate, this means that we have far better clarity on mime
type than relying on the http header+file suffix. So, for testing purposes,
you (or we over on Tika) can much more easily gather a small test corpus of
files by mime type.
Many, many thanks to Sebastian and Common Crawl!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: Tuesday, July 4, 2017 6:18 AM
To: [email protected]
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler
(modified Nutch) with the target to get clean and correct MIME type - the HTTP
Content-Type may contain garbage and isn't always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sent by the
server in the HTTP header and as detected by Tika 1.15 [2]. It shows that
content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from
HTTP headers).
A look on the "confusions" where Content-Type and Tika differ, shows a mixed
picture: some pairs are plausible, e.g., if Tika changes the type to a more
precise subtype or detects the MIME at all:
Tika-1.15 HTTP-Content-Type
1001968023 application/xhtml+xml text/html
2298146 application/rss+xml text/xml
617435 application/rss+xml application/xml
613525 text/html unk
361525 application/xhtml+xml unk
297707 application/rdf+xml application/xml
However, there are a few dubious decisions, esp. the group of web server-side
scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
Tika-1.15 HTTP-Content-Type
2047739 text/x-php text/html
681629 text/asp text/html
193095 text/x-coldfusion text/html
172318 text/aspdotnet text/html
139033 text/x-jsp text/html
38415 text/x-cgi text/html
32092 text/x-php text/xml
18021 text/x-perl text/html
Of course, due to misconfigurations some servers may deliver the script files
unmodified but in general I wouldn't expect that this happens for millions of
pages. I've checked some of the affected URLs:
- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p
- HTML with some scripting fragments ("<?php?>") present:
http://www.eco-ani-yao.org/shien/
- others are clearly HTML (looks more like a bug, at least, there is no simple
explanation)
http://www.proedinc.com/customer/content.aspx?redid=9
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
Obviously certain file suffixes (.php, .aspx) should get less weight compared
to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in
Tika?
If anyone is interested in using the detected MIME types or anything else from
Common Crawl - I'm happy to help! The URL index [4] contains now a new field
"mime-detected" which makes it easy to search or grep for confusion pairs.
Thanks and best,
Sebastian
[1] https://github.com/commoncrawl/nutch/issues/3
[2]
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3]
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/