Hello,
I am trying to write multiple files with Spark, but I can not find a way to
do it.
Here is the idea.
val rddKeyValue : Rdd[(String, String)] = rddlines.map( line =>
createKeyValue(line))
now I would like to save this as <keyname.txt> and all the values inside
the file
I tried to use this after the map, but it would overwrite the file, so I
would get only one value for each file.
With GroupByKey I get outOfMemoryError, so I wonder if there is a way to
append the next line on the text with the same key ??
On Hadoop we can use IdentityReducer and KeyBAsedOutput[1]
I tried to this:
rddKeyValue.saveAsHadoopFile("hdfs://test-platform-analytics-master/tmp/dump/product",
classOf[String], classOf[String], classOf[KeyBasedOutput[String, String]])
[1]
class KeyBasedOutput[T >: Null ,V <: AnyRef] extends
MultipleTextOutputFormat[T , V] {
/**
* Use they key as part of the path for the final output file.
*/
override protected def generateFileNameForKeyValue(key: T, value: V,
leaf: String) = {
key.toString()
}
/**
* When actually writing the data, discard the key since it is already in
* the file path.
*/
override protected def generateActualKey(key: T, value: V) = {
null
}
}
Thanks a lot