There are a couple of tools you can use. Take a look at the various functions.
Specifically, limit might be useful for you and sample/sampleBy functions can 
make your data smaller.
Actually, when using CreateDataframe you can sample the data to begin with.

Specifically working by partitions can be done by moving through the RDD 
interface but I am not sure this is what you want. Actually working through a 
specific partition might mean seeing skewed data because of the hashing method 
used to partition (this would of course depend on how your dataframe was 
created).

Just to get smaller data sample/sampleBy seems like the best solution to me.

Assaf.

From: Yanwei Zhang [mailto:actuary_zh...@hotmail.com]
Sent: Wednesday, November 02, 2016 6:29 PM
To: user
Subject: Use a specific partition of dataframe

Is it possible to retrieve a specific partition  (e.g., the first partition) of 
a DataFrame and apply some function there? My data is too large, and I just 
want to get some approximate measures using the first few partitions in the 
data. I'll illustrate what I want to accomplish using the example below:

// create date
val tmp = sc.parallelize(Seq( ("a", 1), ("b", 2), ("a", 1),
                                  ("b", 2), ("a", 1), ("b", 2)), 2).toDF("var", 
"value")
// I want to get the first partition only, and do some calculation, for 
example, count by the value of "var"
tmp1 = tmp.getPartition(0)
tmp1.groupBy("var").count()

The idea is not to go through all the data to save computational time. So I am 
not sure whether mapPartitionsWithIndex is helpful in this case, since it still 
maps all data.

Regards,
Wayne


Reply via email to