Hi All, Any chance of fixing this one ? https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work around may be?
Currently, I got bunch of events streaming into kafka across various topics and they are stamped with an UUIDv1 for each event. so it is easy to construct edges using UUID. I am not quite sure how to generate a long based unique id without synchronization in a distributed setting. I had read this SO post <https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid> which shows there are two ways one may be able to achieve this 1. UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE 2. (System.currentTimeMillis() << 20) | (System.nanoTime() & ~ 9223372036854251520L) However I am concerned about collisions and looking for the probability of collisions for the above two approaches. any suggestions? I ran the Connected Components algorithms using graphframes it runs well when long based id's are used but with string the performance drops significantly as pointed out in the ticket. I understand that algorithm depends on hashing integers heavily but I wonder why not fixed length byte[] ? that way we can convert any datatype to sequence of bytes. Thanks!