I am probably not stating my problem correctly and have yet fully understood the Java Spark API,
This is what i would like to do. Read a file (which is not sorted), sort the file by a key extracted by each line. Then partition the initial file into k files. The only restriction is that all lines associated with a specific key are written in the same file. This is my code so far. JavaSparkContext ctx = new JavaSparkContext(args[0], "BasicSplit", System.getenv("SPARK_HOME"), avaSparkContext.jarOfClass(BasicSplit.class)); JavaRDD<String> input = ctx.textFile(args[1], 1); // Map based on key extracted by each line. JavaPairRDD<String, String> ones = input.map(new Split()); // GRoup based on key and partition by k then JavaPairRDD<String, List<String>> twos = ones.groupByKey(k); // require only the values to be saved JavaRDD<List<String>> threes = twos.values(); //Write each element in the list as a single line //Code to translate....fours... fours.saveAsTextFile(args[2]); First of all i am not sure if groupByKey is the correct approach, sortByKey maybe what ia m looking for any insight would be helpful... there are not many example out there for newbies such as myself -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html Sent from the Apache Spark User List mailing list archive at Nabble.com.