Ya this is a good way to do it.
On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey <vipan...@gmail.com> wrote: > Hi, > > I need to batch the values in my final RDD before writing out to hdfs. The > idea is to batch multiple "rows" in a protobuf and write those batches out > - mostly to save some space as a lot of metadata is the same. > e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records > instead of 6 > > What I"m doing is that I'm using mapPartitions by using the grouped > function of the iterator by giving it a groupSize. > > val protoRDD:RDD[MyProto] = > rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{ > val profiles = MyProto(...) > seq.foreach(x =>{ > val row = new Row(x._1.toString) > row.setFloatValue(x._2) > profiles.addRow(row) > }) > profiles > }) > ) > I haven't been able to test it out because of a separate issue (protobuf > version mismatch - in a different thread) - but i'm hoping it will work. > > Is there a better/straight-forward way of doing this? > > Thanks > Vipul