First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point.
On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas <[email protected] > wrote: > Is this a > Scala-only<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFile>feature? > > > On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell <[email protected]>wrote: > >> For textFile I believe we overload it and let you set a codec directly: >> >> >> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 >> >> For saveAsSequenceFile yep, I think Mark is right, you need an option. >> >> >> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra <[email protected]>wrote: >> >>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option >>> >>> The signature is 'def saveAsSequenceFile(path: String, codec: >>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a >>> Class, not an Option[Class]. >>> >>> Try counts.saveAsSequenceFile(output, >>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) >>> >>> >>> >>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev < >>> [email protected]> wrote: >>> >>>> Hi there, >>>> >>>> >>>> I've started using Spark recently and evaluating possible use cases in >>>> our company. >>>> >>>> I'm trying to save RDD as compressed Sequence file. I'm able to save >>>> non-compressed file be calling: >>>> >>>> counts.saveAsSequenceFile(output) >>>> >>>> where counts is my RDD (IntWritable, Text). However, I didn't manage to >>>> compress output. I tried several configurations and always got exception: >>>> >>>> counts.saveAsSequenceFile(output, >>>> classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>>> <console>:21: error: type mismatch; >>>> found : >>>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>>> required: Option[Class[_ <: >>>> org.apache.hadoop.io.compress.CompressionCodec]] >>>> counts.saveAsSequenceFile(output, >>>> classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>>> >>>> counts.saveAsSequenceFile(output, >>>> classOf[org.apache.spark.io.SnappyCompressionCodec]) >>>> <console>:21: error: type mismatch; >>>> found : >>>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec]) >>>> required: Option[Class[_ <: >>>> org.apache.hadoop.io.compress.CompressionCodec]] >>>> counts.saveAsSequenceFile(output, >>>> classOf[org.apache.spark.io.SnappyCompressionCodec]) >>>> >>>> and it doesn't work even for Gzip: >>>> >>>> counts.saveAsSequenceFile(output, >>>> classOf[org.apache.hadoop.io.compress.GzipCodec]) >>>> <console>:21: error: type mismatch; >>>> found : >>>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec]) >>>> required: Option[Class[_ <: >>>> org.apache.hadoop.io.compress.CompressionCodec]] >>>> counts.saveAsSequenceFile(output, >>>> classOf[org.apache.hadoop.io.compress.GzipCodec]) >>>> >>>> Could you please suggest solution? also, I didn't find how is it >>>> possible to specify compression parameters (i.e. compression type for >>>> Snappy). I wondered if you could share code snippets for writing/reading >>>> RDD with compression? >>>> >>>> Thank you in advance, >>>> Konstantin Kudryavtsev >>>> >>> >>> >> >
