RE: [External] Re: Sorting in Spark on multiple partitions

2018-06-06 Thread Sing, Jasbir
jornfra...@gmail.com] Sent: Monday, June 4, 2018 10:59 PM To: Jain, Neha T. mailto:neha.t.j...@accenture.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org>; Patel, Payal mailto:payal.pa...@accenture.com>>; Sing, Jasbir mailto:jasbir.s...@accenture.com>> Subject: Re: [E

Re: [External] Re: Sorting in Spark on multiple partitions

2018-06-04 Thread Jörn Franke
I think also there is a misunderstanding how repartition works. It keeps the existing number of partitions, but hash partitions according to userid. Means in each partition it is likely to have different user ids. That would also explain your observed behavior. However without having the full

Re: [External] Re: Sorting in Spark on multiple partitions

2018-06-04 Thread Jörn Franke
How do you load the data? How do you write it? I fear without a full source code it will be difficult to troubleshoot the issue. Which Spark version? Use case is not yet 100% clear to me. You want to set the row with the oldest/newest date to true? I would just use top or something similar when