Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
Abdeali, Jason: while submitting spark job num-executors 8, num-cores 8, driver-memory 14g and executor-memory 14g, the size of total data was processed were 5 GB with 100+ aggregation and 50+ different joins at various data frame level. So it is really hard to tell specific number of partitions.

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
My thinking is that if you run everything in one partition - say 12 GB - then you don't experience the partitioning problem - one partition will have all duplicates. If that's not the case, there are other options, but would probably require a design change. On Thu, Apr 4, 2019 at 8:46 AM Jason N

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
How much memory do you have per partition? On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri wrote: > I will get the information and will share with you. > > On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari > wrote: > >> How long does it take to do the window solution ? (Also mention how many >> execu

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
I will get the information and will share with you. On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari wrote: > How long does it take to do the window solution ? (Also mention how many > executors was your spark application using on average during that time) > I am not aware of anything that is fast

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Abdeali Kothari
How long does it take to do the window solution ? (Also mention how many executors was your spark application using on average during that time) I am not aware of anything that is faster. When I ran is on my data ~8-9GB I think it took less than 5 mins (don't remember exact time) On Thu, Apr 4, 20

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
Thanks for awesome clarification / explanation. I have cases where update_time can be same. I am in need of suggestions, where I have very large data like 5 GB, this window based solution which I mentioned is taking very long time. Thanks again. On Thu, Apr 4, 2019 at 12:11 PM Abdeali Kothari w

Re: dropDuplicate on timestamp based column unexpected output

2019-04-03 Thread Abdeali Kothari
So, the above code for min() worked for me fine in general, but there was one corner case where it failed. Which was when I have something like: invoice_id=1, update_time=*2018-01-01 15:00:00.000* invoice_id=1, update_time=*2018-01-01 15:00:00.000* invoice_id=1, update_time=2018-02-03 14:00:00.000

Re: dropDuplicate on timestamp based column unexpected output

2019-04-03 Thread Chetan Khatri
Hello Abdeali, Thank you for your response. Can you please explain me this line, And the dropDuplicates at the end ensures records with two values for the same 'update_time' don't cause issues. Sorry I didn't get quickly. :) On Thu, Apr 4, 2019 at 10:41 AM Abdeali Kothari wrote: > I've faced t

Re: dropDuplicate on timestamp based column unexpected output

2019-04-03 Thread Abdeali Kothari
I've faced this issue too - and a colleague pointed me to the documentation - https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates dropDuplicates docs does not say that it will guarantee that it will return the "first" record (even if you sort your da