Abdeali, Jason:
while submitting spark job num-executors 8, num-cores 8, driver-memory 14g
and executor-memory 14g, the size of total data was processed were 5 GB
with 100+ aggregation and 50+ different joins at various data frame level.
So it is really hard to tell specific number of partitions.
My thinking is that if you run everything in one partition - say 12 GB -
then you don't experience the partitioning problem - one partition will
have all duplicates.
If that's not the case, there are other options, but would probably require
a design change.
On Thu, Apr 4, 2019 at 8:46 AM Jason N
How much memory do you have per partition?
On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri
wrote:
> I will get the information and will share with you.
>
> On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari
> wrote:
>
>> How long does it take to do the window solution ? (Also mention how many
>> execu
I will get the information and will share with you.
On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari
wrote:
> How long does it take to do the window solution ? (Also mention how many
> executors was your spark application using on average during that time)
> I am not aware of anything that is fast
How long does it take to do the window solution ? (Also mention how many
executors was your spark application using on average during that time)
I am not aware of anything that is faster. When I ran is on my data ~8-9GB
I think it took less than 5 mins (don't remember exact time)
On Thu, Apr 4, 20
Thanks for awesome clarification / explanation.
I have cases where update_time can be same.
I am in need of suggestions, where I have very large data like 5 GB, this
window based solution which I mentioned is taking very long time.
Thanks again.
On Thu, Apr 4, 2019 at 12:11 PM Abdeali Kothari
w
So, the above code for min() worked for me fine in general, but there was
one corner case where it failed.
Which was when I have something like:
invoice_id=1, update_time=*2018-01-01 15:00:00.000*
invoice_id=1, update_time=*2018-01-01 15:00:00.000*
invoice_id=1, update_time=2018-02-03 14:00:00.000
Hello Abdeali, Thank you for your response.
Can you please explain me this line, And the dropDuplicates at the end
ensures records with two values for the same 'update_time' don't cause
issues.
Sorry I didn't get quickly. :)
On Thu, Apr 4, 2019 at 10:41 AM Abdeali Kothari
wrote:
> I've faced t
I've faced this issue too - and a colleague pointed me to the documentation
-
https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates
dropDuplicates docs does not say that it will guarantee that it will return
the "first" record (even if you sort your da