Hello Dear Spark Users,
I am using dropDuplicate on a DataFrame generated from large parquet file
from(HDFS) and doing dropDuplicate based on timestamp based column, every
time I run it drops different - different rows based on same timestamp.
What I tried and worked
val wSpec = Window.partitionBy($"invoice_id").orderBy($"update_time".desc)
val irqDistinctDF = irqFilteredDF.withColumn("rn",
row_number.over(wSpec)).where($"rn" === 1) .drop("rn").drop("update_time")
But this is damn slow...
Can someone please throw a light.
Thanks