[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176934#comment-14176934 ]
Andrew Jackson commented on TIKA-1302: -------------------------------------- I have 2,358,167 errors from one collection (2 billion resources), but the majority are SAXParseExceptions. It's made up of UK web archive content from 1996-2010, so there's lots of broken HTML/XML in there. If I strip out the SAXParseExceptions, there's just 317,548 miscellaneous errors, that are perhaps more interesting. Here's an example including the SAX exceptions: {code:none} wayback_date,url,content_length,content_type_tika,parse_error 20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,"org.xml.sax.SAXParseException: Open quote is expected for attribute ""ID"" associated with an element type ""COMMENT""." 20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548&map_only=yes&type=overview,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20061020022224,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5&type=rss;action=.xml,org.xml.sax.SAXParseException: The character reference must end with the ';' delimiter. 20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=courses&sn=26,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20060926015856,http://www.qca.org.uk/4412.html,"org.xml.sax.SAXParseException: The entity ""nbsp"" was referenced\, but not declared." 20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException: -1 20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=porter&cov=&mode=buy&o=4854130936&code=9123&cu=&,"org.xml.sax.SAXParseException: The element type ""META"" must be terminated by the matching end-tag ""</META>""." 20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference. {code} ...and for the others... {code:none} wayback_date,url,content_length,content_type_tika,parse_error 20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException: null 20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException: -1 20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException: Position 698368 past the end of the file 20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,"java.io.IOException: Error: Expected a long type\, actual='25#0/'" 20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException: oversubscribed literal/length tree 20041121100005,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException: oversubscribed dynamic bit lengths tree 20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException: Default password is invalid for docId/saltData/saltHash 20051111183952,http://freeweb.co.uk:80/show_nw.php?ref=258&target=B&show=aff&PHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html; charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting 20071025140555,http://www.honleyhigh.kirklees.sch.uk/MFL/MFL_Links/PowerPoint%20Presentations/German/Geryear-9-future-tense.ppt,2664960,application/vnd.ms-powerpoint,"org.apache.poi.hslf.exceptions.OldPowerPointFormatException: Based on the Current User stream\, you seem to have supplied a PowerPoint95 file\, which isn't supported" 20071207004337,http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt,155136,application/vnd.ms-powerpoint,java.lang.ArrayIndexOutOfBoundsException: 20 {code} The first two columns identify the item. The next two are the size of the item in bytes, and the result of using Tika to identity the format (.detect only, no parse). The last column contains the first line of the parse exception(s). Note that to download the original item, you can get them from the Internet archive using this template: {code:none} http://web.archive.org/web/{wayback_date}/{url} {code} i.e. for the last exception listed above, you can download the item at: http://web.archive.org/web/20071207004337/http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt It might take me a while to generate the full output for the 2.3 million, so I'll try to pull out the 300 thousand other errors first. Our Solr index is having some performance issues, so it might a bit slow. > Let's run Tika against a large batch of docs nightly > ---------------------------------------------------- > > Key: TIKA-1302 > URL: https://issues.apache.org/jira/browse/TIKA-1302 > Project: Tika > Issue Type: Improvement > Components: cli, general, server > Reporter: Tim Allison > > Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and > running again, it might be fun to run Tika regularly against a large set of > docs and report metrics. > One excellent candidate corpus is govdocs1: > http://digitalcorpora.org/corpora/files. > Any other candidate corpora? > [~willp-bl], have anything handy you'd like to contribute? > [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] > ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)