[GitHub] flink pull request: [FLINK-2310] Add an Adamic Adar Similarity exa...

shghatge Thu, 09 Jul 2015 08:26:52 -0700

Github user shghatge commented on the pull request:

    https://github.com/apache/flink/pull/892#issuecomment-120033517
  
    Hello @vasia 
    I would like to work on both versions of Adamic Adar. As the JIRA did not 
ask for an approximate version, it was suggested that I create another JIRA 
issue which will provide a library method for Adamic Adar which gives 
approximate solution with the use of bloom filters.
    I have a query about the bloom filters. Since bloom filters only tell us 
whether an element belongs to the set or not, if both the vertices have Bloom 
filters as value, how will we know what to search for in the other set? For 
example. for Example for Vertex 3 '1,4,13' are set and for Vertex 5 '2,4,13' 
are set. Now when we use the method suggested by you, we will find that 4 and 
13 are set for 5 too. Now what tuple should it emit? Do you suggest that we 
keep another hashtable that keeps track of a value->vertex relation? Or do we 
just emit 5,4,1/log(d3) and keep the hashtable as an identity map function? 
That would mean each vertex has n number of bits as value , where n is the 
number of vertices in the graph. I hope I was clear in my query. TL;DR We will 
have to use an identity hash function which implies that each vertex will need 
n bits of memory as value. Is it okay to use this much memory? If there is some 
other approach then please let me know. Bloom filters seem to be more
  useful in finding size of the intersection or union but here we need to know 
which Vertices are common. The only other way that I can roughly imagine is 
that we get the hashed edges in a dataset, just like 5,4,1/(logd3)... Use the 
same hash function on all the graph edges. Then Join the datasets obtained over 
field 1 and 2. 
    Please tell me if there is any other efficient way or which one of these 
two you would prefer?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2310] Add an Adamic Adar Similarity exa...

Reply via email to