Dataframe constructor

2015-11-23 Thread spark_user_2015
Dear all, is the following usage of the Dataframe constructor correct or does it trigger any side effects that I should be aware of? My goal is to keep track of my dataframe's state and allow custom transformations accordingly. val df: Dataframe = ...some dataframe... val newDf = new DF(df.s

Discretization

2015-05-07 Thread spark_user_2015
The Spark documentation shows the following example code: // Discretize data in 16 equal bins since ChiSqSelector requires categorical features val discretizedData = data.map { lp => LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 } ) ) } I'm sort of missing why "x /

Re: Caching and Actions

2015-04-09 Thread spark_user_2015
That was helpful! The conclusion: (1) The mappers are not executed in parallel when processing independently the same RDD. (2) The best way seems to be (if enough memory is available and an action is applied to d1 and d2 later on) val d1 = data.map((x,y,z) => (x,y)).cache val d2 = d1

Caching and Actions

2015-04-07 Thread spark_user_2015
I understand that RDDs are not created until an action is called. Is it a correct conclusion that it doesn't matter if ".cache" is used anywhere in the program if I only have one action that is called only once? Related to this question, consider this situation: val d1 = data.map((x,y,z) => (x,y)

EC2 spark-submit --executor-memory

2015-04-07 Thread spark_user_2015
Dear Spark team, I'm using the EC2 script to startup a Spark cluster. If I login and use the executor-memory parameter in the submit script, the UI tells me that no cores are assigned to the job and nothing happens. Without executor-memory everything works fine... Until I get "dag-scheduler-event-