Sorry - the call at the end of my message should be: (some identity (apply pcalls (repeat f))).
I also realized that the random seed contention issues that were discussed on the list earlier might prevent very much gain from parallelizing the accept-reject. Garth On Sat, Jun 26, 2010 at 5:04 PM, Garth Sheldon-Coulson <g...@mit.edu> wrote: > Well, my idea involves rejection sampling. In order to sample an element > from the set in a mass-weighted fashion, you can sample an element in a > uniform fashion and then reject the sample with a certain probability (more > below). If you do reject the sample, you keep drawing from the uniform > proposal distribution until you get an acceptance. This can be quite > efficient if the most massive element in the set is not too much more > massive than the majority of the other elements. If, however, the elements > have very different masses, then it can be inefficient. Regardless, my > intuition is that you can't do better than rejection sampling in this case, > because the probability distribution from which you're sampling is > constantly changing and not of any particular functional form. > > In order to make rejection sampling work, I think you'll need to be able to > do two things: > > 1. Efficiently sample an element from the set in a uniform > (non-mass-weighted) fashion. > 2. Efficiently look up the mass of the most massive element currently in > the set. > > And you said you want set-like lookup by element, so here's a third > requirement: > > 3. Efficiently look up an element's mass by element. > > The hard part is actually (1). In order to get (1), you somewhere need an > index to your elements indexed by index, so that you can uniformly sample an > element efficiently by saying (nth index (rand-int (count index))) or just > (rand-nth index). For this you need the index to be a vector, and this makes > it hard to get efficient deletions. Below I use weak deletions, but this > isn't ideal. This is why I asked about deletions. AFAIK, there's no way to > efficiently sample a uniform random element from a hashmap, hashset, sorted > map, or sorted set... you would need to call seq on any of those to query it > by index, which would be really bad for lookup times. > > But here's the idea. I think it's not bad anyway. > > Let's say you have j objects object_i with masses mass_i, where 0 <= i < j. > Then the data structure I have in mind is composed of: > > 1. A vector v whose elements are object_i. > 2. A sorted set index ss whose elements are [mass_i, object_i] and whose > comparator is #(> (first %1) (first %2)) - so, sorted from largest to > smallest. > 3. A hashmap index h whose key->value pairs are object_i->[mass_i, > vindex_i], where vindex_i is the index of object_i in the vector v. This > allows O(log32 n) access to the mass of each element and also to the vector > index of each element. > 4. A scalar tm holding the total mass of all objects in the set. > > To insert an object, simply conj it onto each structure as appropriate and > update the total mass. > > To delete an element object_k, first look up its mass_k and vindex_k in the > hashmap, and then return a data structure with: > > 1. Vector (assoc v vindex_k ::unused), i.e. a weak deletion. This is > problematic because the length of the vector never decreases with deletions, > and this affects the efficiency of sampling later. I can't think of any way > around this, because we need an index indexed by numbers. > 2. Sorted set (disj ss [mass_k, object_k]). > 3. Hashmap (disj h object_k). > 4. Scalar (- tm mass_k). > > To sample in a mass-weighted manner, we do rejection sampling. Rejection > sampling has two steps: sampling from a uniform proposal distribution over > the elements in the set, and then accepting the sampled element with some > probability (and, if it is rejected, sampling another). The acceptance > probability depends on the mass of the sampled element and also on the mass > of the largest element in the set, because we need to make sure the proposal > distribution, which is a uniform times some factor, envelopes the > distribution from which we are trying to sample. > > As an algorithm over v, ss, h, and tm: > > 1. Sample an integer pindex ("proposal index") from a uniform distribution > over [0, (count v)), i.e. let pindex be (rand-int (count v)). > 2. Let p ("proposal") be (nth v pindex). This step is O(log32 n). If p is > ::unused, then go back to step 1 -- in other words, keep sampling until we > get a uniform random element of v. > 3. Look up the mass of p, i.e. let mass_p be (first (h p)). This step is > O(log32 n). > 4. Look up the mass of the most massive element in the set, i.e. let > max-mass be (first (first ss)). This step is O(1). > 5. Let n be (count v). Accept and return the proposal with probability > mass_p/max-mass = (mass_p/tm)/((1/n)*(max-mass*n/tm)), and otherwise reject > and go back to step 1. This step is fast if there aren't too many rejections > on average. > > On multicore hardware, instead of looping from step 5 back to step 1, you > could do things in parallel: put steps 1-5 in a function f yielding an > accepted value or a nil for rejection, and then do (some (apply pcalls > (repeat f))) to get an acceptance. > > I hope this helps. Definitely you should double-check my reasoning =). > > Garth > > > > On Sat, Jun 26, 2010 at 3:24 PM, Nicolas Oury <nicolas.o...@gmail.com>wrote: > >> Insertion and deletions. But I would like to hear your idea anyway. >> Always good to hear ideas :) >> >> >> On Sat, Jun 26, 2010 at 7:58 PM, Garth Sheldon-Coulson <g...@mit.edu>wrote: >> >>> Are there going to be a lot of deletions from the set? Or mostly >>> insertions? >>> >>> If it's mostly insertions (or if it's just a static data structure that >>> stays the same once built) then I think I can help. >>> >>> On Sat, Jun 26, 2010 at 9:01 AM, Nicolas Oury <nicolas.o...@gmail.com>wrote: >>> >>>> Dear all, >>>> >>>> for a project, I need a data structure to represent finite distribution, >>>> that is a set of elements, each of which having a mass in the set, and two >>>> functions: >>>> >>>> - total-mass : the sum of the masses of each element >>>> - draw : return an element with a probability proportional to its mass >>>> >>>> I currently use the simple idea : a Persistent Hash Map from elt to >>>> mass: >>>> - the total-mass is the reduce of (+ . second) >>>> - drawing is choosing a random-number between 0 and the total-mass and >>>> looping through the seq, to look for the corresponding element. >>>> >>>> Both operation are linear in the number of elements. This is a problem, >>>> because these sets tend to be quite big in my project. >>>> >>>> I think that if I used a HashTrie with total-mass cached in each node, >>>> total-mass would be in constant time and draw in log time. >>>> >>>> So here is my question : does anyone has, for example for the "clojure >>>> in clojure" project, a sketch of an implementation of PersistentHashMap >>>> in clojure itself? (I would be happy to contribute, if it needs more >>>> work). >>>> Or has anyone a better idea that mine to have drawable sets? >>>> >>>> Thanks for your help, >>>> >>>> Best regards, >>>> >>>> Nicolas. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To post to this group, send email to clojure@googlegroups.com >>>> Note that posts from new members are moderated - please be patient with >>>> your first post. >>>> To unsubscribe from this group, send email to >>>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> >>>> For more options, visit this group at >>>> http://groups.google.com/group/clojure?hl=en >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To post to this group, send email to clojure@googlegroups.com >>> Note that posts from new members are moderated - please be patient with >>> your first post. >>> To unsubscribe from this group, send email to >>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> >>> For more options, visit this group at >>> http://groups.google.com/group/clojure?hl=en >>> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en