Re: Recommended Scala version

2015-05-31 Thread Tathagata Das
Can you file a JIRA with the detailed steps to reproduce the problem? On Fri, May 29, 2015 at 2:59 AM, Alex Nakos wrote: > Hi- > > I’ve just built the latest spark RC from source (1.4.0 RC3) and can > confirm that the spark shell is still NOT working properly on 2.11. No > classes in the jar I'v

Re: Recommended Scala version

2015-05-31 Thread Alex Nakos
Hi- Yup, I’ve already done so here: https://issues.apache.org/jira/browse/SPARK-7944 Please let me know if this requires any more information - more than happy to provide whatever I can. Thanks Alex On Sun, May 31, 2015 at 8:45 AM, Tathagata Das wrote: > Can you file a JIRA with the detailed

RDD staleness

2015-05-31 Thread Ashish Mukherjee
Hello, Since RDDs are created from data from Hive tables or HDFS, how do we ensure they are invalidated when the source data is updated? Regards, Ashish

Re: RDD staleness

2015-05-31 Thread DW @ Gmail
There is no mechanism for keeping an RDD up to date with a changing source. However you could set up a steam that watches for changes to the directory and processes the new files or use the Hive integration in SparkSQL to run Hive queries directly. (However, old query results will still grow sta

data localisation in spark

2015-05-31 Thread Shushant Arora
I want to understand how spark takes care of data localisation in cluster mode when run on YARN. 1.Driver program asks ResourceManager for executors. Does it tell yarn's RM to check HDFS blocks of input data and then allocate executors to it. And executors remain fixed throughout application or d

union and reduceByKey wrong shuffle?

2015-05-31 Thread igor.berman
I've encountered very strange problem, after doing union of 2 rdds the reduceByKey works wrong(unless I'm missing something very basic) and brings to the function that reduces 2 objects with different key! I've rewrited java class to scala to test it in spark-shell and I see same problem I have Sin

Re: import CSV file using read.csv

2015-05-31 Thread Akhil Das
If it is spark related, then Something like this? csv = sc.textFile("hdfs:///stats/test.csv").map(*myFunc*) And create a myFunc in which you will convert the String to a CSV record and do whatever you want to do with it? Thanks Best Regards On Sun, May 31, 2015 at 2:50 AM, sherine ahmed wrote:

Re: Adding an indexed column

2015-05-31 Thread Ricardo Almeida
That's great and how would you create an ordered index by partition (by product in this example)? Assuming now a dataframe like: flag | product | price -- 1| a |47.808764653746 1| b |47.808764653746 1| a |31.9869279512204 1| b |47.790789

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread igor.berman
after investigation the problem is somehow connected to avro serialization with kryo + chill-avro(mapping avro object to simple scala case class and running reduce on these case class objects solves the problem) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread Josh Rosen
Which Spark version are you using? I'd like to understand whether this change could be caused by recent Kryo serializer re-use changes in master / Spark 1.4. On Sun, May 31, 2015 at 11:31 AM, igor.berman wrote: > after investigation the problem is somehow connected to avro serialization > with

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread Igor Berman
Hi We are using spark 1.3.1 Avro-chill (tomorrow will check if its important) we register avro classes from java Avro 1.7.6 On May 31, 2015 22:37, "Josh Rosen" wrote: > Which Spark version are you using? I'd like to understand whether this > change could be caused by recent Kryo serializer re-us

Re: data localisation in spark

2015-05-31 Thread Sandy Ryza
Hi Shushant, Spark currently makes no effort to request executors based on data locality (although it does try to schedule tasks within executors based on data locality). We're working on adding this capability at SPARK-4352 . -Sandy On Sun, May

Re: RDD staleness

2015-05-31 Thread Michael Armbrust
Each time you run a Spark SQL query we will create new RDDs that load the data and thus you should see the newest results. There is one caveat: formats that use the native Data Source API (parquet, ORC (in Spark 1.4), JSON (in Spark 1.5)) cache file metadata to speed up interactive querying. To cl

Re: Adding an indexed column

2015-05-31 Thread ayan guha
If you are on spark 1.3, use repartitionandSort followed by mappartition. In 1.4, window functions will be supported, it seems On 1 Jun 2015 04:10, "Ricardo Almeida" wrote: > That's great and how would you create an ordered index by partition (by > product in this example)? > > Assuming now a dat

Re: Windowed Operations

2015-05-31 Thread DMiner
I also met the same issue. Any updates on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Windowed-Operations-tp15133p23094.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Create dataframe from saved objectfile RDD

2015-05-31 Thread bipin
Hi, what is the method to create ddf from an RDD which is saved as objectfile. I don't have a java object but a structtype I want to use as schema for ddf. How to load the objectfile without the object. I tried retrieving as Row val myrdd = sc.objectFile[org.apache.spark.sql.Row]("/home/bi