I see that most recent code doesn't has RDDApi anymore. But i still would like to understand the logic of partitions of DataFrame. Does DataFrame has it's own partitions and is sort of RDD by itself, or it depends on the partitions of the underline RDD that was used to load the data?
For example, if i create DataFrame from HadoopRDD - does it means that DataFrame has the same partitions as HadoopRDD? Thanks Gil. From: Gil Vernik/Haifa/IBM@IBMIL To: Dev <dev@spark.apache.org> Date: 12/07/2015 13:06 Subject: question related partitions of the DataFrame Hi, DataFrame extends RDDApi, that provides RDD like methods. My question is, does DataFrame is sort of stand alone RDD with it?s own partitions or it depends on the underlying RDD that was used to load the data into its partitions? It's written that DataFrame has ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster, but i don't understand if the partitions of data frame are independent of the partitions of the data source that was used to load the data. So assume theoretically that i used external DataSource API and wrote code that load 1GB of data into single partition. Then I map this DataSource to DataFrame and perform some SQL that returns all the records. Will this DataFrame also has one partition in memory or Spark somehow will divide this DataFrame into various partitions? If so, how it will be divide it into partitions? By size? (can someone point me to the code to see some example)? Thanks, Gil.