As for 'rdd.zipwithIndex.partitionBy(YourCustomPartitioner)', can I just drop
some records using my custom partitioner, otherwise I still have to call
rdd.take() to get exactly 1 records.
And repartition is THE expensive operation that I want to walk around.
Actually, what I expect the limi
It can be easily done using an RDD.
rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your
items.
Here YourCustomPartitioner will know how to pick sample items from each
partition.
If you want to stick to Dataframe you can always repartition the data after
you apply the limit.
I am going to have the above scenario without using limit clause then will
it work check among all the partitions.
On Dec 24, 2015 9:26 AM, "汪洋" wrote:
> I see.
>
> Thanks.
>
>
> 在 2015年12月24日,上午11:44,Zhan Zhang 写道:
>
> There has to have a central point to collaboratively collecting exactly
> 10
I see.
Thanks.
> 在 2015年12月24日,上午11:44,Zhan Zhang 写道:
>
> There has to have a central point to collaboratively collecting exactly 1
> records, currently the approach is using one single partitions, which is easy
> to implement.
> Otherwise, the driver has to count the number of record
There has to have a central point to collaboratively collecting exactly 1
records, currently the approach is using one single partitions, which is easy
to implement.
Otherwise, the driver has to count the number of records in each partition and
then decide how many records to be materialize
It is an application running as an http server. So I collect the data as the
response.
> 在 2015年12月24日,上午8:22,Hudong Wang 写道:
>
> When you call collect() it will bring all the data to the driver. Do you mean
> to call persist() instead?
>
> From: tiandiwo...@icloud.com
> Subject: Problem usin