Thanks you all. Just changing RDD to Map  structure saved me approx. 1
second.

Yes, I will check out IndexedRDD to see if it has better performance.

best,
/Shahab

On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz <[email protected]> wrote:

> If your dataset is large, there is a Spark Package called IndexedRDD
> optimized for lookups. Feel free to check that out.
>
> Burak
> On Feb 19, 2015 7:37 AM, "Ilya Ganelin" <[email protected]> wrote:
>
>> Hi Shahab - if your data structures are small enough a broadcasted Map is
>> going to provide faster lookup. Lookup within an RDD is an O(m) operation
>> where m is the size of the partition. For RDDs with multiple partitions,
>> executors can operate on it in parallel so you get some improvement for
>> larger RDDs.
>> On Thu, Feb 19, 2015 at 7:31 AM shahab <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
>>> lookup is relatively slow 30-100 ms ?? I even tried this on one machine
>>> with single partition, but no difference!
>>>
>>> The RDDs are not large at all, 3-30 MB.
>>>
>>> Is this expected behaviour? should I use other data structures, like
>>> HashMap to keep data and look up it there and use Broadcast to send a copy
>>> to all machines?
>>>
>>> best,
>>> /Shahab
>>>
>>>
>>>

Reply via email to