Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball
If you load data using ORC or parquet, the RDD will have a partition per file, so in fact your data frame will not directly match the partitioning of the table. If you want to process by and guarantee preserving partitioning then mapPartition etc will be useful. Note that if you perform any

distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Deenar Toraskar
Hi I have data in HDFS partitioned by a logical key and would like to preserve the partitioning when creating a dataframe for the same. Is it possible to create a dataframe that preserves partitioning from HDFS or the underlying RDD? Regards Deenar

Rdd partitioning

2015-07-11 Thread anshu shukla
Suppose i have RDD with 10 tuples and cluster with 100 cores (standalone mode) the by dafault how the partition will be done. I did not get how it will divide 20 tuples set (RDD) to 100 cores .(By default ) Mentioned in documentation - *spark.default.parallelism* For distributed shuffle operati