Hi, similar to Dominiks approach of checking the file base for parsing errors, I'd like to scan for certain file constellations, for the typically "left over bytes" error or other record combinations which I can't reproduce with my MS/Libre office versions.
I haven't thought about how it's actually done, but I think logging the location in the integration tests and later manually checking the corresponding files should be sufficient. Best wishes, Andi On 03.04.2015 17:51, Dominik Stadler wrote: > Hi, > > I am very interested as I am following the Common Crawl activity for > some time already. It sounds like a neat idea to do the check already > when the crawl is done, are the binary documents already part of the > crawl-data? > > ... > > Dominik. > > On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. <[email protected]> > wrote: >> All, >> >> What do you think? >> >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
