Hi I have the following code which I run as part of thread which becomes
child job of my main Spark job it takes hours to run for large data around
1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster
with more data sets size sometimes it does not complete at all. Please guide
what I am doing wrong please help. Thanks in advance.
JavaRDD<Row> maksedRDD =
sourceRdd.coalesce(1,true).mapPartitionsWithIndex(new Function2<Integer,
Iterator<Row>, Iterator<Row>>() {
@Override
public Iterator<Row> call(Integer ind, Iterator<Row>
rowIterator) throws Exception {
List<Row> rowList = new ArrayList<>();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
List rowAsList =
updateRowsMethod(JavaConversions.seqAsJavaList(row.toSeq()));
Row updatedRow = RowFactory.create(rowAsList.toArray());
rowList.add(updatedRow);
}
return rowList.iterator();
}
}, true);
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimiz-and-make-this-code-faster-using-coalesce-1-and-mapPartitionIndex-tp25947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]