Well, my idea involves rejection sampling. In order to sample an element from the set in a mass-weighted fashion, you can sample an element in a uniform fashion and then reject the sample with a certain probability (more below). If you do reject the sample, you keep drawing from the uniform proposal distribution until you get an acceptance. This can be quite efficient if the most massive element in the set is not too much more massive than the majority of the other elements. If, however, the elements have very different masses, then it can be inefficient. Regardless, my intuition is that you can't do better than rejection sampling in this case, because the probability distribution from which you're sampling is constantly changing and not of any particular functional form.
In order to make rejection sampling work, I think you'll need to be able to do two things: 1. Efficiently sample an element from the set in a uniform (non-mass-weighted) fashion. 2. Efficiently look up the mass of the most massive element currently in the set. And you said you want set-like lookup by element, so here's a third requirement: 3. Efficiently look up an element's mass by element. The hard part is actually (1). In order to get (1), you somewhere need an index to your elements indexed by index, so that you can uniformly sample an element efficiently by saying (nth index (rand-int (count index))) or just (rand-nth index). For this you need the index to be a vector, and this makes it hard to get efficient deletions. Below I use weak deletions, but this isn't ideal. This is why I asked about deletions. AFAIK, there's no way to efficiently sample a uniform random element from a hashmap, hashset, sorted map, or sorted set... you would need to call seq on any of those to query it by index, which would be really bad for lookup times. But here's the idea. I think it's not bad anyway. Let's say you have j objects object_i with masses mass_i, where 0 <= i < j. Then the data structure I have in mind is composed of: 1. A vector v whose elements are object_i. 2. A sorted set index ss whose elements are [mass_i, object_i] and whose comparator is #(> (first %1) (first %2)) - so, sorted from largest to smallest. 3. A hashmap index h whose key->value pairs are object_i->[mass_i, vindex_i], where vindex_i is the index of object_i in the vector v. This allows O(log32 n) access to the mass of each element and also to the vector index of each element. 4. A scalar tm holding the total mass of all objects in the set. To insert an object, simply conj it onto each structure as appropriate and update the total mass. To delete an element object_k, first look up its mass_k and vindex_k in the hashmap, and then return a data structure with: 1. Vector (assoc v vindex_k ::unused), i.e. a weak deletion. This is problematic because the length of the vector never decreases with deletions, and this affects the efficiency of sampling later. I can't think of any way around this, because we need an index indexed by numbers. 2. Sorted set (disj ss [mass_k, object_k]). 3. Hashmap (disj h object_k). 4. Scalar (- tm mass_k). To sample in a mass-weighted manner, we do rejection sampling. Rejection sampling has two steps: sampling from a uniform proposal distribution over the elements in the set, and then accepting the sampled element with some probability (and, if it is rejected, sampling another). The acceptance probability depends on the mass of the sampled element and also on the mass of the largest element in the set, because we need to make sure the proposal distribution, which is a uniform times some factor, envelopes the distribution from which we are trying to sample. As an algorithm over v, ss, h, and tm: 1. Sample an integer pindex ("proposal index") from a uniform distribution over [0, (count v)), i.e. let pindex be (rand-int (count v)). 2. Let p ("proposal") be (nth v pindex). This step is O(log32 n). If p is ::unused, then go back to step 1 -- in other words, keep sampling until we get a uniform random element of v. 3. Look up the mass of p, i.e. let mass_p be (first (h p)). This step is O(log32 n). 4. Look up the mass of the most massive element in the set, i.e. let max-mass be (first (first ss)). This step is O(1). 5. Let n be (count v). Accept and return the proposal with probability mass_p/max-mass = (mass_p/tm)/((1/n)*(max-mass*n/tm)), and otherwise reject and go back to step 1. This step is fast if there aren't too many rejections on average. On multicore hardware, instead of looping from step 5 back to step 1, you could do things in parallel: put steps 1-5 in a function f yielding an accepted value or a nil for rejection, and then do (some (apply pcalls (repeat f))) to get an acceptance. I hope this helps. Definitely you should double-check my reasoning =). Garth On Sat, Jun 26, 2010 at 3:24 PM, Nicolas Oury <nicolas.o...@gmail.com>wrote: > Insertion and deletions. But I would like to hear your idea anyway. > Always good to hear ideas :) > > > On Sat, Jun 26, 2010 at 7:58 PM, Garth Sheldon-Coulson <g...@mit.edu>wrote: > >> Are there going to be a lot of deletions from the set? Or mostly >> insertions? >> >> If it's mostly insertions (or if it's just a static data structure that >> stays the same once built) then I think I can help. >> >> On Sat, Jun 26, 2010 at 9:01 AM, Nicolas Oury <nicolas.o...@gmail.com>wrote: >> >>> Dear all, >>> >>> for a project, I need a data structure to represent finite distribution, >>> that is a set of elements, each of which having a mass in the set, and two >>> functions: >>> >>> - total-mass : the sum of the masses of each element >>> - draw : return an element with a probability proportional to its mass >>> >>> I currently use the simple idea : a Persistent Hash Map from elt to >>> mass: >>> - the total-mass is the reduce of (+ . second) >>> - drawing is choosing a random-number between 0 and the total-mass and >>> looping through the seq, to look for the corresponding element. >>> >>> Both operation are linear in the number of elements. This is a problem, >>> because these sets tend to be quite big in my project. >>> >>> I think that if I used a HashTrie with total-mass cached in each node, >>> total-mass would be in constant time and draw in log time. >>> >>> So here is my question : does anyone has, for example for the "clojure in >>> clojure" project, a sketch of an implementation of PersistentHashMap >>> in clojure itself? (I would be happy to contribute, if it needs more >>> work). >>> Or has anyone a better idea that mine to have drawable sets? >>> >>> Thanks for your help, >>> >>> Best regards, >>> >>> Nicolas. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To post to this group, send email to clojure@googlegroups.com >>> Note that posts from new members are moderated - please be patient with >>> your first post. >>> To unsubscribe from this group, send email to >>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> >>> For more options, visit this group at >>> http://groups.google.com/group/clojure?hl=en >>> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en