I am probably not stating my problem correctly and have yet fully understood
the Java Spark API,

This is what i would like to do. Read a file (which is not sorted), sort the
file by a key extracted by each line. Then partition the initial file into k
files. The only restriction is that all lines associated with a specific key
are written in the same file. 
This is my code so far.

JavaSparkContext ctx = new JavaSparkContext(args[0], "BasicSplit",
System.getenv("SPARK_HOME"), avaSparkContext.jarOfClass(BasicSplit.class));
                
JavaRDD<String> input = ctx.textFile(args[1], 1);
// Map based on key extracted by each line.
JavaPairRDD<String, String> ones = input.map(new Split());
// GRoup based on key and partition by k then
JavaPairRDD<String, List&lt;String>> twos = ones.groupByKey(k);
// require only the values to be saved
JavaRDD<List&lt;String>> threes = twos.values();
//Write each element in the list as a single line
//Code to translate....fours...
fours.saveAsTextFile(args[2]);

First of all i am not sure if groupByKey is the correct approach, sortByKey
maybe what ia m looking for any insight would be helpful... there are not
many example out there for newbies such as myself




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to