Rdd - zip with index

KhajaAsmath Mohammed Tue, 23 Mar 2021 18:18:33 -0700

Hi,

I have 10gb file that should be loaded into spark dataframe. This file is csv 
with header and we were using rdd.zipwithindex to get column names and convert 
to avro accordingly.


I am assuming this is taking long time and only executor runs and never 
achieves parallelism. Is there a easy way to achieve parallelism after 
filtering out the header. 

I am
Also interested in solution that can remove header from the file and I can give 
my own schema. This way I can split the files.

Rdd.partitions is always 1 for this even after repartitioning the dataframe 
after zip with index . Any help on this topic please .

Thanks,
Asmath

Sent from my iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Rdd - zip with index

Reply via email to