Hi,

I have a pretty large data set(2M entities) in my RDD, the data has already
been partitioned by a specific key, the key has a range(type in long), now
I want to create a bunch of key buckets, for example, the key has range

    1 -> 100,

I will break the whole range into below buckets

    1 ->  10
    11 -> 20
    ...
    90 -> 100

 I want to run some analytic SQL functions over the data that owned by each
key range, so I come up with 2 approaches,

1) run RDD's filter() on the full data set RDD, the filter will create the
RDD corresponding to each key bucket, and with each RDD, I can create
DataFrame and run the sql.


2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my
job.

    SELECT * from XXXX where key>=key1 AND key <key2

So my question is which one is better from performance perspective?

Thanks

-- 
--Anfernee

Reply via email to