Fwd: MultipleOutputs IdentityReducer

Andre Kuhnen Tue, 29 Apr 2014 17:43:19 -0700

Hello,

I am trying to write multiple files with Spark, but I can not find a way to
do it.


Here is the idea.

val rddKeyValue : Rdd[(String, String)] = rddlines.map( line =>
createKeyValue(line))

now I would like to save this as  <keyname.txt> and all the values inside
the file

I tried to use this after the map,  but it would overwrite the file, so I
would get only one value for each file.

With GroupByKey I get outOfMemoryError,  so  I wonder if there is a way to
append the next line on the text with the same key ??
On Hadoop we can use IdentityReducer  and KeyBAsedOutput[1]

I tried to this:

rddKeyValue.saveAsHadoopFile("hdfs://test-platform-analytics-master/tmp/dump/product",
classOf[String], classOf[String], classOf[KeyBasedOutput[String, String]])

[1]
class KeyBasedOutput[T >: Null ,V <: AnyRef] extends
MultipleTextOutputFormat[T , V] {

  /**
   * Use they key as part of the path for the final output file.
   */

  override protected def generateFileNameForKeyValue(key: T, value: V,
leaf: String) = {
    key.toString()
  }

  /**
   * When actually writing the data, discard the key since it is already in
   * the file path.
   */

  override protected def generateActualKey(key: T, value: V) = {
    null
  }
}

Thanks a lot

Fwd: MultipleOutputs IdentityReducer

Reply via email to