Re: ReduceByKey or groupByKey to Count?

Mayur Rustagi Thu, 27 Feb 2014 11:20:20 -0800

Sortbykey would be better I think as I am not sure groupbyKey will sort the
keyspace globally.


I would say you should
you take input K, V
GroupbyKey K,V => K,Seq(V..)
partitionBy default partitioner (hash)
SoryByKey K,Seq(V..)
Output this, only thing is if you need K,V pairs you will have to construct
them again.

Optionally you can do this
you take input K, V
Partitionby(rangepartitioner)
sortbykey (each partition is individually sorted)
output this


Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Feb 27, 2014 at 2:37 AM, dmpour23 <dmpou...@gmail.com> wrote:

> I am probably not stating my problem correctly and have yet fully
> understood
> the Java Spark API,
>
> This is what i would like to do. Read a file (which is not sorted), sort
> the
> file by a key extracted by each line. Then partition the initial file into
> k
> files. The only restriction is that all lines associated with a specific
> key
> are written in the same file.
> This is my code so far.
>
> JavaSparkContext ctx = new JavaSparkContext(args[0], "BasicSplit",
> System.getenv("SPARK_HOME"), avaSparkContext.jarOfClass(BasicSplit.class));
>
> JavaRDD<String> input = ctx.textFile(args[1], 1);
> // Map based on key extracted by each line.
> JavaPairRDD<String, String> ones = input.map(new Split());
> // GRoup based on key and partition by k then
> JavaPairRDD<String, List&lt;String>> twos = ones.groupByKey(k);
> // require only the values to be saved
> JavaRDD<List&lt;String>> threes = twos.values();
> //Write each element in the list as a single line
> //Code to translate....fours...
> fours.saveAsTextFile(args[2]);
>
> First of all i am not sure if groupByKey is the correct approach, sortByKey
> maybe what ia m looking for any insight would be helpful... there are not
> many example out there for newbies such as myself
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: ReduceByKey or groupByKey to Count?

Reply via email to