Hi, I have 10gb file that should be loaded into spark dataframe. This file is csv with header and we were using rdd.zipwithindex to get column names and convert to avro accordingly.
I am assuming this is taking long time and only executor runs and never achieves parallelism. Is there a easy way to achieve parallelism after filtering out the header. I am Also interested in solution that can remove header from the file and I can give my own schema. This way I can split the files. Rdd.partitions is always 1 for this even after repartitioning the dataframe after zip with index . Any help on this topic please . Thanks, Asmath Sent from my iPhone --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org