Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
thanks, this direction seems to be inline with what I want. what i really want is groupBy() and then for the rows in each group, get an Iterator, and run each element from the iterator through a local function (specifically SGD), right now the DataSet API provides this , but it's literally an Iter

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
thanks. exactly this is what I ended up doing finally. though it seemed to work, there seems to be guarantee that the randomness after the sortWithinPartitions() would be preserved after I do a further groupBy. On Fri, Oct 21, 2016 at 3:55 PM, Cheng Lian wrote: > I think it would much easier

Re: RDD groupBy() then random sort each group ?

2016-10-22 Thread Koert Kuipers
groupBy always materializes the entire group (on disk or in memory) which is why you should avoid it for large groups. The key is to never materialize the grouped and shuffled data. To see one approach to do this take a look at https://github.com/tresata/spark-sorted It's basically a combination

Re: RDD groupBy() then random sort each group ?

2016-10-21 Thread Cheng Lian
I think it would much easier to use DataFrame API to do this by doing local sort using randn() as key. For example, in Spark 2.0: val df = spark.range(100) val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42)) Replace df with a DataFrame wrapping your RDD, and $"id" % 10 wit

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang
in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD r

Re: DataFrame groupBy vs RDD groupBy

2015-05-23 Thread ayan guha
>> u => >> val rs = RowSchema(u, schemaOriginel) >> val col1 = rs.getValueByName("col1") >> val col2 = rs.getValueByName("col2") >> (col1, col2) >> } .zipW

Re: DataFrame groupBy vs RDD groupBy

2015-05-22 Thread Michael Armbrust
;col2") > (col1, col2) > } .zipWithIndex > } > > /I think the SQL equivalent of what I try to do : / > > SELECT a, > ROW_NUMBER() OVER (PARTITION BY a) AS num > FROM table. > > > I don&

DataFrame groupBy vs RDD groupBy

2015-05-22 Thread gtanguy
PARTITION BY a) AS num FROM table. I don't think I can do this with a GroupedData (result of df.groupby). Any ideas on how I can speed up this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html S

Question on RDD groupBy and executors

2015-03-17 Thread Vijayasarathy Kannan
Hi, I am doing a groupBy on an EdgeRDD like this, val groupedEdges = graph.edges.groupBy[VertexId](func0) while(true) { val info = groupedEdges.flatMap(func1).collect.foreach(func2) } The groupBy distributes the data to different executors on different nodes in the cluster. Given a key K (a v

Re: RDD groupBy

2015-02-23 Thread Vijayasarathy Kannan
"println(vertexId)" doesn't printing anything. I > have > > made sure that "f2" doesn't return an empty iterable. I am not sure what > I > > am missing here. > > > > 2. I am assuming that "f2" is called for each group in para

Re: RDD groupBy

2015-02-23 Thread Sean Owen
ty iterable. I am not sure what I > am missing here. > > 2. I am assuming that "f2" is called for each group in parallel. Is this > correct? If not, what is the correct way to operate on each group in > parallel? > > > Thanks! > > > > -- > View

RDD groupBy

2015-02-23 Thread kvvt
not sure what I am missing here. 2. I am assuming that "f2" is called for each group in parallel. Is this correct? If not, what is the correct way to operate on each group in parallel? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD