Sortbykey would be better I think as I am not sure groupbyKey will sort the keyspace globally.
I would say you should you take input K, V GroupbyKey K,V => K,Seq(V..) partitionBy default partitioner (hash) SoryByKey K,Seq(V..) Output this, only thing is if you need K,V pairs you will have to construct them again. Optionally you can do this you take input K, V Partitionby(rangepartitioner) sortbykey (each partition is individually sorted) output this Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Thu, Feb 27, 2014 at 2:37 AM, dmpour23 <dmpou...@gmail.com> wrote: > I am probably not stating my problem correctly and have yet fully > understood > the Java Spark API, > > This is what i would like to do. Read a file (which is not sorted), sort > the > file by a key extracted by each line. Then partition the initial file into > k > files. The only restriction is that all lines associated with a specific > key > are written in the same file. > This is my code so far. > > JavaSparkContext ctx = new JavaSparkContext(args[0], "BasicSplit", > System.getenv("SPARK_HOME"), avaSparkContext.jarOfClass(BasicSplit.class)); > > JavaRDD<String> input = ctx.textFile(args[1], 1); > // Map based on key extracted by each line. > JavaPairRDD<String, String> ones = input.map(new Split()); > // GRoup based on key and partition by k then > JavaPairRDD<String, List<String>> twos = ones.groupByKey(k); > // require only the values to be saved > JavaRDD<List<String>> threes = twos.values(); > //Write each element in the list as a single line > //Code to translate....fours... > fours.saveAsTextFile(args[2]); > > First of all i am not sure if groupByKey is the correct approach, sortByKey > maybe what ia m looking for any insight would be helpful... there are not > many example out there for newbies such as myself > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >