You have 2 parts to it 1. Do a sub query where for each primary key derive latest value of flag=1 records. Ensure you get exactly 1 record per primary key value. Here you can use rank() over (partition by primary key order by year desc)
2. Join your original dataset with the above on primary key. If year is higher than latest flag=1 record then take it else mark it null If you can have primary keys which may have no flag=1 records then they wouldn't show up in set 1 above. So if you still want them in result then adjust 1 accordingly. Best Ayan On Mon., 19 Dec. 2016 at 1:01 pm, Richard Xin <richardxin...@yahoo.com.invalid> wrote: > I am not sure I understood your logic, but it seems to me that you could > take a look of Hive's Lead/Lag functions. > > > On Monday, December 19, 2016 1:41 AM, Milin korath < > milin.kor...@impelsys.com> wrote: > > > thanks, I tried with left outer join. My dataset having around 400M > records and lot of shuffling is happening.Is there any other workaround > apart from Join,I tried use window function but I am not getting a proper > solution, > > > Thanks > > On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust <mich...@databricks.com> > wrote: > > Oh and to get the null for missing years, you'd need to do an outer join > with a table containing all of the years you are interested in. > > On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust <mich...@databricks.com> > wrote: > > Are you looking for argmax? Here is an example > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html> > . > > On Wed, Dec 14, 2016 at 8:49 PM, Milin korath <milin.kor...@impelsys.com> > wrote: > > Hi > > I have a spark data frame with following structure > > id flag price date > > a 0 100 2015 > > a 0 50 2015 > > a 1 200 2014 > > a 1 300 2013 > > a 0 400 2012 > > I need to create a data frame with recent value of flag 1 and updated in > the flag 0 rows. > > id flag price date new_column > > a 0 100 2015 200 > > a 0 50 2015 200 > > a 1 200 2014 null > > a 1 300 2013 null > > a 0 400 2012 null > > We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2 > values(200 and 300) and I am taking the recent one 200(2014). And the last > row I don't have any recent value for flag 1 so it is updated with null. > Looking for a solution using scala. Any help would be appreciated.Thanks > > Thanks > Milin > > > > > > > > > > > >