thanks, this direction seems to be inline with what I want.
what i really want is
groupBy() and then for the rows in each group, get an Iterator, and run
each element from the iterator through a local function (specifically SGD),
right now the DataSet API provides this , but it's literally an Iter
thanks.
exactly this is what I ended up doing finally. though it seemed to work,
there seems to be guarantee that the randomness after the
sortWithinPartitions() would be preserved after I do a further groupBy.
On Fri, Oct 21, 2016 at 3:55 PM, Cheng Lian wrote:
> I think it would much easier
groupBy always materializes the entire group (on disk or in memory) which
is why you should avoid it for large groups.
The key is to never materialize the grouped and shuffled data.
To see one approach to do this take a look at
https://github.com/tresata/spark-sorted
It's basically a combination
I think it would much easier to use DataFrame API to do this by doing
local sort using randn() as key. For example, in Spark 2.0:
val df = spark.range(100)
val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42))
Replace df with a DataFrame wrapping your RDD, and $"id" % 10 wit
in my application, I group by same training samples by their model_id's
(the input table contains training samples for 100k different models),
then each group ends up having about 1 million training samples,
then I feed that group of samples to a little Logistic Regression solver
(SGD), but SGD r
>> u =>
>> val rs = RowSchema(u, schemaOriginel)
>> val col1 = rs.getValueByName("col1")
>> val col2 = rs.getValueByName("col2")
>> (col1, col2)
>> } .zipW
;col2")
> (col1, col2)
> } .zipWithIndex
> }
>
> /I think the SQL equivalent of what I try to do : /
>
> SELECT a,
> ROW_NUMBER() OVER (PARTITION BY a) AS num
> FROM table.
>
>
> I don&
PARTITION BY a) AS num
FROM table.
I don't think I can do this with a GroupedData (result of df.groupby). Any
ideas on how I can speed up this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html
S
Hi,
I am doing a groupBy on an EdgeRDD like this,
val groupedEdges = graph.edges.groupBy[VertexId](func0)
while(true) {
val info = groupedEdges.flatMap(func1).collect.foreach(func2)
}
The groupBy distributes the data to different executors on different nodes
in the cluster.
Given a key K (a v
"println(vertexId)" doesn't printing anything. I
> have
> > made sure that "f2" doesn't return an empty iterable. I am not sure what
> I
> > am missing here.
> >
> > 2. I am assuming that "f2" is called for each group in para
ty iterable. I am not sure what I
> am missing here.
>
> 2. I am assuming that "f2" is called for each group in parallel. Is this
> correct? If not, what is the correct way to operate on each group in
> parallel?
>
>
> Thanks!
>
>
>
> --
> View
not sure what I
am missing here.
2. I am assuming that "f2" is called for each group in parallel. Is this
correct? If not, what is the correct way to operate on each group in
parallel?
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD
12 matches
Mail list logo