SPARK-1153

kant kodali Sun, 23 Feb 2020 15:54:52 -0800

Hi All,

Any chance of fixing this one ?
https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
around may be?


Currently, I got bunch of events streaming into kafka across various topics
and they are stamped with an UUIDv1 for each event. so it is easy to
construct edges using UUID. I am not quite sure how to generate a long
based unique id without synchronization in a distributed setting. I had
read this SO post
<https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid>
which
shows there are two ways one may be able to achieve this

1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE

2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
9223372036854251520L)

However I am concerned about collisions and looking for the probability of
collisions for the above two approaches. any suggestions?

I ran the Connected Components algorithms using graphframes it runs well
when long based id's are used but with string the performance drops
significantly as pointed out in the ticket. I understand that algorithm
depends on hashing integers heavily but I wonder why not fixed length
byte[] ? that way we can convert any datatype to sequence of bytes.

Thanks!

https://spark-project.atlassian.net/browse/SPARK-1153

Reply via email to