Yes, that is an option.

I started with a function of batch time, and index to generate id as long. This 
may be faster than generating UUID, with added benefit of sorting based on time.

----- Original Message -----
From: "Tathagata Das" <[email protected]>
To: "Soumitra Kumar" <[email protected]>
Cc: "Xiangrui Meng" <[email protected]>, [email protected]
Sent: Thursday, August 28, 2014 2:19:38 AM
Subject: Re: Spark Streaming: DStream - zipWithIndex


If just want arbitrary unique id attached to each record in a dstream (no 
ordering etc), then why not create generate and attach an UUID to each record? 





On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar < [email protected] > 
wrote: 



I see a issue here. 


If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. 


I wish there was DStream mapPartitionsWithIndex. 





On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng < [email protected] > wrote: 


You can use RDD id as the seed, which is unique in the same spark 
context. Suppose none of the RDDs would contain more than 1 billion 
records. Then you can use 

rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) 

Just a hack .. 

On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar 


< [email protected] > wrote: 
> So, I guess zipWithUniqueId will be similar. 
> 
> Is there a way to get unique index? 
> 
> 
> On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng < [email protected] > wrote: 
>> 
>> No. The indices start at 0 for every RDD. -Xiangrui 
>> 
>> On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar 
>> < [email protected] > wrote: 
>> > Hello, 
>> > 
>> > If I do: 
>> > 
>> > DStream transform { 
>> > rdd.zipWithIndex.map { 
>> > 
>> > Is the index guaranteed to be unique across all RDDs here? 
>> > 
>> > } 
>> > } 
>> > 
>> > Thanks, 
>> > -Soumitra. 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to