Thanks all, it works fine now and I managed to compress output. However, I am still in stuck... How is it possible to set compression type for Snappy? I mean to set up record or block level of compression for output On Apr 3, 2014 1:15 AM, "Nicholas Chammas" <[email protected]> wrote:
> Thanks for pointing that out. > > > On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra <[email protected]>wrote: > >> First, you shouldn't be using spark.incubator.apache.org anymore, just >> spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in >> the Python API at this point. >> >> >> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas < >> [email protected]> wrote: >> >>> Is this a >>> Scala-only<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFile>feature? >>> >>> >>> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell <[email protected]>wrote: >>> >>>> For textFile I believe we overload it and let you set a codec directly: >>>> >>>> >>>> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 >>>> >>>> For saveAsSequenceFile yep, I think Mark is right, you need an option. >>>> >>>> >>>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra >>>> <[email protected]>wrote: >>>> >>>>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option >>>>> >>>>> The signature is 'def saveAsSequenceFile(path: String, codec: >>>>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a >>>>> Class, not an Option[Class]. >>>>> >>>>> Try counts.saveAsSequenceFile(output, >>>>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) >>>>> >>>>> >>>>> >>>>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> >>>>>> I've started using Spark recently and evaluating possible use cases >>>>>> in our company. >>>>>> >>>>>> I'm trying to save RDD as compressed Sequence file. I'm able to save >>>>>> non-compressed file be calling: >>>>>> >>>>>> >>>>>> >>>>>> counts.saveAsSequenceFile(output) >>>>>> >>>>>> where counts is my RDD (IntWritable, Text). However, I didn't manage >>>>>> to compress output. I tried several configurations and always got >>>>>> exception: >>>>>> >>>>>> >>>>>> >>>>>> counts.saveAsSequenceFile(output, >>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>>>>> <console>:21: error: type mismatch; >>>>>> found : >>>>>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>>>>> required: Option[Class[_ <: >>>>>> org.apache.hadoop.io.compress.CompressionCodec]] >>>>>> counts.saveAsSequenceFile(output, >>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>>>>> >>>>>> counts.saveAsSequenceFile(output, >>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec]) >>>>>> <console>:21: error: type mismatch; >>>>>> found : >>>>>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec]) >>>>>> required: Option[Class[_ <: >>>>>> org.apache.hadoop.io.compress.CompressionCodec]] >>>>>> counts.saveAsSequenceFile(output, >>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec]) >>>>>> >>>>>> and it doesn't work even for Gzip: >>>>>> >>>>>> >>>>>> >>>>>> counts.saveAsSequenceFile(output, >>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec]) >>>>>> <console>:21: error: type mismatch; >>>>>> found : >>>>>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec]) >>>>>> required: Option[Class[_ <: >>>>>> org.apache.hadoop.io.compress.CompressionCodec]] >>>>>> counts.saveAsSequenceFile(output, >>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec]) >>>>>> >>>>>> Could you please suggest solution? also, I didn't find how is it >>>>>> possible to specify compression parameters (i.e. compression type for >>>>>> Snappy). I wondered if you could share code snippets for writing/reading >>>>>> RDD with compression? >>>>>> >>>>>> Thank you in advance, >>>>>> Konstantin Kudryavtsev >>>>>> >>>>> >>>>> >>>> >>> >> >
