Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Nicholas Chammas
That's a neat idea. I'll try that out. On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell wrote: > I think there are a few ways to do this... the simplest one might be to > manually build a set of comma-separated paths that excludes the bad file, > and pass that to textFile(). > > When you call t

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile(). When you call textFile() under the hood it is going to pass your filename string to hadoopFile() which calls setInputPaths(

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-30 Thread Nicholas Chammas
YES, your hunches were correct. I’ve identified at least one file among the hundreds I’m processing that is indeed not a valid gzip file. Does anyone know of an easy way to exclude a specific file or files when calling sc.textFile() on a pattern? e.g. Something like: sc.textFile('s3n://bucket/stuf

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Nicholas Chammas
Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I’m using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Andrew Ash
One thing you can try is to pull each file out of S3 and decompress with "gzip -d" to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler wrote: > Hi Nick, > > Which version of Hadoop are you using wit

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Michael Cutler
Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail h

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Madhu
Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a "bad header" error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaia

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Yes, it does work with fewer GZipped files. I am reading the files in using sc.textFile() and a pattern string. For example: a = sc.textFile('s3n://bucket/2014-??-??/*.gz') a.count() Nick ​ On Tue, May 20, 2014 at 10:09 PM, Madhu wrote: > I have read gzip files from S3 successfully. > > It s

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Madhu
I have read gzip files from S3 successfully. It sounds like a file is corrupt or not a valid gzip file. Does it work with fewer gzip files? How are you reading the files? - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.100

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Any tips on how to troubleshoot this? On Thu, May 15, 2014 at 4:15 PM, Nick Chammas wrote: > I’m trying to do a simple count() on a large number of GZipped files in > S3. My job is failing with the following message: > > 14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to > java.io