Hi there,
I've started using Spark recently and evaluating possible use cases in our
company.
I'm trying to save RDD as compressed Sequence file. I'm able to save
non-compressed file be calling:
counts.saveAsSequenceFile(output)
where counts is my RDD (IntWritable, Text). However, I didn't manage to
compress output. I tried several configurations and always got exception:
counts.saveAsSequenceFile(output,
classOf[org.apache.hadoop.io.compress.SnappyCodec])
<console>:21: error: type mismatch;
found :
Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
counts.saveAsSequenceFile(output,
classOf[org.apache.hadoop.io.compress.SnappyCodec])
counts.saveAsSequenceFile(output,
classOf[org.apache.spark.io.SnappyCompressionCodec])
<console>:21: error: type mismatch;
found :
Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
counts.saveAsSequenceFile(output,
classOf[org.apache.spark.io.SnappyCompressionCodec])
and it doesn't work even for Gzip:
counts.saveAsSequenceFile(output,
classOf[org.apache.hadoop.io.compress.GzipCodec])
<console>:21: error: type mismatch;
found :
Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
counts.saveAsSequenceFile(output,
classOf[org.apache.hadoop.io.compress.GzipCodec])
Could you please suggest solution? also, I didn't find how is it possible to
specify compression parameters (i.e. compression type for Snappy). I wondered
if you could share code snippets for writing/reading RDD with compression?
Thank you in advance,
Konstantin Kudryavtsev