Dear all,
is the following usage of the Dataframe constructor correct or does it
trigger any side effects that I should be aware of?
My goal is to keep track of my dataframe's state and allow custom
transformations accordingly.
val df: Dataframe = ...some dataframe...
val newDf = new DF(df.s
The Spark documentation shows the following example code:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical
features
val discretizedData = data.map { lp =>
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16
} ) )
}
I'm sort of missing why "x /
That was helpful!
The conclusion:
(1) The mappers are not executed in parallel when processing independently
the same RDD.
(2) The best way seems to be (if enough memory is available and an action is
applied to d1 and d2 later on)
val d1 = data.map((x,y,z) => (x,y)).cache
val d2 = d1
I understand that RDDs are not created until an action is called. Is it a
correct conclusion that it doesn't matter if ".cache" is used anywhere in
the program if I only have one action that is called only once?
Related to this question, consider this situation:
val d1 = data.map((x,y,z) => (x,y)
Dear Spark team,
I'm using the EC2 script to startup a Spark cluster. If I login and use the
executor-memory parameter in the submit script, the UI tells me that no
cores are assigned to the job and nothing happens. Without executor-memory
everything works fine... Until I get "dag-scheduler-event-