Sorry - the call at the end of my message should be: (some identity (apply
pcalls (repeat f))).

I also realized that the random seed contention issues that were discussed
on the list earlier might prevent very much gain from parallelizing the
accept-reject.

Garth

On Sat, Jun 26, 2010 at 5:04 PM, Garth Sheldon-Coulson <g...@mit.edu> wrote:

> Well, my idea involves rejection sampling. In order to sample an element
> from the set in a mass-weighted fashion, you can sample an element in a
> uniform fashion and then reject the sample with a certain probability (more
> below). If you do reject the sample, you keep drawing from the uniform
> proposal distribution until you get an acceptance. This can be quite
> efficient if the most massive element in the set is not too much more
> massive than the majority of the other elements. If, however, the elements
> have very different masses, then it can be inefficient. Regardless, my
> intuition is that you can't do better than rejection sampling in this case,
> because the probability distribution from which you're sampling is
> constantly changing and not of any particular functional form.
>
> In order to make rejection sampling work, I think you'll need to be able to
> do two things:
>
> 1. Efficiently sample an element from the set in a uniform
> (non-mass-weighted) fashion.
> 2. Efficiently look up the mass of the most massive element currently in
> the set.
>
> And you said you want set-like lookup by element, so here's a third
> requirement:
>
> 3. Efficiently look up an element's mass by element.
>
> The hard part is actually (1). In order to get (1), you somewhere need an
> index to your elements indexed by index, so that you can uniformly sample an
> element efficiently by saying (nth index (rand-int (count index))) or just
> (rand-nth index). For this you need the index to be a vector, and this makes
> it hard to get efficient deletions. Below I use weak deletions, but this
> isn't ideal. This is why I asked about deletions. AFAIK, there's no way to
> efficiently sample a uniform random element from a hashmap, hashset, sorted
> map, or sorted set... you would need to call seq on any of those to query it
> by index, which would be really bad for lookup times.
>
> But here's the idea. I think it's not bad anyway.
>
> Let's say you have j objects object_i with masses mass_i, where 0 <= i < j.
> Then the data structure I have in mind is composed of:
>
> 1. A vector v whose elements are object_i.
> 2. A sorted set index ss whose elements are [mass_i, object_i] and whose
> comparator is #(> (first %1) (first %2)) - so, sorted from largest to
> smallest.
> 3. A hashmap index h whose key->value pairs are object_i->[mass_i,
> vindex_i], where vindex_i is the index of object_i in the vector v. This
> allows O(log32 n) access to the mass of each element and also to the vector
> index of each element.
> 4. A scalar tm holding the total mass of all objects in the set.
>
> To insert an object, simply conj it onto each structure as appropriate and
> update the total mass.
>
> To delete an element object_k, first look up its mass_k and vindex_k in the
> hashmap, and then return a data structure with:
>
> 1. Vector (assoc v vindex_k ::unused), i.e. a weak deletion. This is
> problematic because the length of the vector never decreases with deletions,
> and this affects the efficiency of sampling later. I can't think of any way
> around this, because we need an index indexed by numbers.
> 2. Sorted set (disj ss [mass_k, object_k]).
> 3. Hashmap (disj h object_k).
> 4. Scalar (- tm mass_k).
>
> To sample in a mass-weighted manner, we do rejection sampling. Rejection
> sampling has two steps: sampling from a uniform proposal distribution over
> the elements in the set, and then accepting the sampled element with some
> probability (and, if it is rejected, sampling another). The acceptance
> probability depends on the mass of the sampled element and also on the mass
> of the largest element in the set, because we need to make sure the proposal
> distribution, which is a uniform times some factor, envelopes the
> distribution from which we are trying to sample.
>
> As an algorithm over v, ss, h, and tm:
>
> 1. Sample an integer pindex ("proposal index") from a uniform distribution
> over [0, (count v)), i.e. let pindex be (rand-int (count v)).
> 2. Let p ("proposal") be (nth v pindex). This step is O(log32 n). If p is
> ::unused, then go back to step 1 -- in other words, keep sampling until we
> get a uniform random element of v.
> 3. Look up the mass of p, i.e. let mass_p be (first (h p)). This step is
> O(log32 n).
> 4. Look up the mass of the most massive element in the set, i.e. let
> max-mass be (first (first ss)). This step is O(1).
> 5. Let n be (count v). Accept and return the proposal with probability
> mass_p/max-mass = (mass_p/tm)/((1/n)*(max-mass*n/tm)), and otherwise reject
> and go back to step 1. This step is fast if there aren't too many rejections
> on average.
>
> On multicore hardware, instead of looping from step 5 back to step 1, you
> could do things in parallel: put steps 1-5 in a function f yielding an
> accepted value or a nil for rejection, and then do (some (apply pcalls
> (repeat f))) to get an acceptance.
>
> I hope this helps. Definitely you should double-check my reasoning =).
>
> Garth
>
>
>
> On Sat, Jun 26, 2010 at 3:24 PM, Nicolas Oury <nicolas.o...@gmail.com>wrote:
>
>> Insertion and deletions. But I would like to hear your idea anyway.
>> Always good to hear ideas :)
>>
>>
>> On Sat, Jun 26, 2010 at 7:58 PM, Garth Sheldon-Coulson <g...@mit.edu>wrote:
>>
>>> Are there going to be a lot of deletions from the set? Or mostly
>>> insertions?
>>>
>>> If it's mostly insertions (or if it's just a static data structure that
>>> stays the same once built) then I think I can help.
>>>
>>> On Sat, Jun 26, 2010 at 9:01 AM, Nicolas Oury <nicolas.o...@gmail.com>wrote:
>>>
>>>> Dear all,
>>>>
>>>> for a project, I need a data structure to represent finite distribution,
>>>> that is a set of elements, each of which having a mass in the set, and two
>>>> functions:
>>>>
>>>> - total-mass : the sum of the masses of each element
>>>> - draw : return an element with a probability proportional to its mass
>>>>
>>>> I currently use the simple idea : a Persistent Hash Map from elt to
>>>> mass:
>>>> - the total-mass is the reduce of (+ . second)
>>>> - drawing is choosing a random-number between 0 and the total-mass and
>>>> looping through the seq, to look for the corresponding element.
>>>>
>>>> Both operation are linear in the number of elements. This is a problem,
>>>> because these sets tend to be quite big in my project.
>>>>
>>>> I think that if I used a HashTrie with total-mass cached in each node,
>>>> total-mass would be in constant time and draw in log time.
>>>>
>>>> So here is my question : does anyone has, for example for the "clojure
>>>> in clojure" project, a sketch of an implementation of PersistentHashMap
>>>> in clojure itself? (I would be happy to contribute, if it needs more
>>>> work).
>>>> Or has anyone a better idea that mine to have drawable sets?
>>>>
>>>> Thanks for your help,
>>>>
>>>> Best regards,
>>>>
>>>> Nicolas.
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clojure@googlegroups.com
>>>> Note that posts from new members are moderated - please be patient with
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/clojure?hl=en
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clojure@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to