Well, the dataframes make it easier to work on some columns of the data only
and to store results in new columns, removing the need to zip it all back
together and thus to preserve order.
On 2017-09-05 14:04 CEST, mehmet.su...@gmail.com wrote:
Hi Johan,
DataFrames are building on top of RDDs,
Hi Johan,
DataFrames are building on top of RDDs, not sure if the ordering
issues are different there. Maybe you could create minimally large
enough simulated data and example series of transformations as an
example to experiment on.
Best,
-m
Mehmet Süzen, MSc, PhD
| PRIVILEGED AND CONFIDENTIAL
Thanks all for your answers. After reading the provided links I am still
uncertain of the details of what I'd need to do to get my calculations right
with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of
the libs and I think they'll be better suited to my needs.
Best,
Joh
On 14 September 2017 at 10:42, wrote:
> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you th
Usually spark ml Models specify the columns they use for training. i.e. you
would only select your columns (X) for model training but metadata i.e.
target labels or your date column (y) would still be present for each row.
schrieb am Do., 14. Sep. 2017 um 10:42 Uhr:
> In several situations I wo
In several situations I would like to zip RDDs knowing that their order
matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I
would like to do:
myData.zip(myModel.predict(myData))
Also the first column in my RDD is a timestamp which I don’t want to be a part
of the mo
(Sorry Mehmet, I'm seeing just now your first reply with the link to SO; it had
first gone to my spam folder :-/ )
On 2017-09-14 10:02 CEST, GRANDE Johan Ext DTSI/DSI wrote:
Well if the order cannot be guaranteed in case of a failure (or at all since
failure can happen transparently), what doe
Well if the order cannot be guaranteed in case of a failure (or at all since
failure can happen transparently), what does it mean to sort an RDD (method
sortBy)?
On 2017-09-14 03:36 CEST mehmet.su...@gmail.com wrote:
I think it is one of the conceptual difference in Spark compare to other
lan
I think it is one of the conceptual difference in Spark compare to
other languages, there is no indexing in plain RDDs, This was the
thread with Ankit:
Yes. So order preservation can not be guaranteed in the case of
failure. Also not sure if partitions are ordered. Can you get the same
sequence of
I'm wondering why you need order preserved, we've had situations where
keeping the source as an artificial field in the dataset was important and
I had to run contortions to inject that (In this case the datasource had no
unique key).
Is this similar?
On 13 September 2017 at 10:46, Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover
elements in other partitions.
On 13 Sep 2017 18:39, "Ankit Maloo" wrote:
> AFAIK, the order of a rdd is maintained across a partition for Map
> operations. There is no way a map operation can change sequence across a
>
AFAIK, the order of a rdd is maintained across a partition for Map
operations. There is no way a map operation can change sequence across a
partition as partition is local and computation happens one record at a
time.
On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" wrote:
I think the order has no meani
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi,
I'm a beginner using Spark with Scala and I'm having trouble understanding
ordering in RDDs. I understand that RDDs are ordered (as they can be sorted)
but that some transformations don't preserve order.
How can I know which transformations preserve order and which don't? Regarding
map, fo
14 matches
Mail list logo