Hi,

I have 10gb file that should be loaded into spark dataframe. This file is csv 
with header and we were using rdd.zipwithindex to get column names and convert 
to avro accordingly. 

I am assuming this is taking long time and only executor runs and never 
achieves parallelism. Is there a easy way to achieve parallelism after 
filtering out the header. 

I am
Also interested in solution that can remove header from the file and I can give 
my own schema. This way I can split the files.

Rdd.partitions is always 1 for this even after repartitioning the dataframe 
after zip with index . Any help on this topic please .

Thanks,
Asmath

Sent from my iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  • Rdd - zip with ind... KhajaAsmath Mohammed
    • Re: Rdd - zip... Yuri Oleynikov (‫יורי אולייניקוב‬‎)
      • Re: Rdd -... KhajaAsmath Mohammed
        • Re: R... Yuri Oleynikov (‫יורי אולייניקוב‬‎)
          • R... Sean Owen
            • ... ayan guha
              • ... Mich Talebzadeh
                • ... Sean Owen
                • ... Mich Talebzadeh
                • ... KhajaAsmath Mohammed
                • ... Sean Owen

Reply via email to