How to get recent value in spark dataframe

2016-12-18 Thread milinkorath
0 down vote favorite I have a spark data frame with following structure id flag price date a 0100 2015 a 050 2015 a 1200 2014 a 1300 2013 a 0400 2012 I need to create a data frame with recent value of flag 1 and updated in the flag 0 rows. id

Re: How to get recent value in spark dataframe

2016-12-18 Thread Milin korath
thanks, I tried with left outer join. My dataset having around 400M records and lot of shuffling is happening.Is there any other workaround apart from Join,I tried use window function but I am not getting a proper solution, Thanks On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust wrote: > Oh a

Question about Spark and filesystems

2016-12-18 Thread joakim
Hello, We are trying out Spark for some file processing tasks. Since each Spark worker node needs to access the same files, we have tried using Hdfs. This worked, but there were some oddities making me a bit uneasy. For dependency hell reasons I compiled a modified Spark, and this version exhibit

[Spark SQL] Task failed while writing rows

2016-12-18 Thread Joseph Naegele
Hi all, I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). It's current compressed size is around 13 GB, but my problem started when it was much smaller, maybe 5 GB. This dataset is generated by

Re: How to get recent value in spark dataframe

2016-12-18 Thread Richard Xin
I am not sure I understood your logic, but it seems to me that you could take a look of Hive's Lead/Lag functions. On Monday, December 19, 2016 1:41 AM, Milin korath wrote: thanks, I tried with left outer join. My dataset having around 400M records and lot of shuffling is happening.Is

Re: The spark hive udf can read broadcast the variables?

2016-12-18 Thread Takeshi Yamamuro
Hi, No, you can't. If you use ScalaUdf, you can like this; val bv = sc.broadcast(100) val testUdf = udf { (i: Long) => i + bv.value } spark.range(10).select(testUdf('id)).show // maropu On Sun, Dec 18, 2016 at 12:24 AM, 李斌松 wrote: > The spark hive udf can read broadcast the variables? >

GraphFrame not init vertices when load edges

2016-12-18 Thread zjp_j...@163.com
Hi, I fond GraphFrame when create edges not init vertiecs by default, has any plan about it or better ways? Thanks val e = sqlContext.createDataFrame(List( ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ("f", "c", "follow"), ("e", "f", "follow"), ("e", "d", "friend

Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Can you clarify? Vertices should be another DataFrame as you can see in the example here: https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md From: zjp_j...@163.com Sent: Sunday, December 18, 2016 6:25:50 PM To: user Subject: GraphFrame n

Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Or this is a better link: http://graphframes.github.io/quick-start.html _ From: Felix Cheung mailto:felixcheun...@hotmail.com>> Sent: Sunday, December 18, 2016 8:46 PM Subject: Re: GraphFrame not init vertices when load edges To: mailto:zjp_j...@163.com>>, user mailto:

Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
There is not a GraphLoader for GraphFrames but you could load and convert from GraphX: http://graphframes.github.io/user-guide.html#graphx-to-graphframe From: zjp_j...@163.com Sent: Sunday, December 18, 2016 9:39:49 PM To: Felix Cheung; user Subject: Re: Re: Gr

Re: Re: GraphFrame not init vertices when load edges

2016-12-18 Thread zjp_j...@163.com
I'm sorry, i didn't expressed clearly. Reference to the following Blod Underlined text. cite from http://spark.apache.org/docs/latest/graphx-programming-guide.html " GraphLoader.edgeListFile provides a way to load a graph from a list of edges on disk. It parses an adjacency list of (source v

RE: PowerIterationClustering Benchmark

2016-12-18 Thread Mostafa Alaa Mohamed
Hi All, I have the same issue with one compressed file .tgz around 3 GB. I increase the nodes without any affect to the performance. Best Regards, Mostafa Alaa Mohamed, Technical Expert Big Data, M: +971506450787 Email: mohamedamost...@etisalat.ae From: Lydi

Re: Question about Spark and filesystems

2016-12-18 Thread vincent gromakowski
I am using gluster and i have decent performance with basic maintenance effort. Advantage of gluster: you can plug Alluxio on top to improve perf but I still need to be validate... Le 18 déc. 2016 8:50 PM, a écrit : > Hello, > > We are trying out Spark for some file processing tasks. > > Since e