Well, my idea involves rejection sampling. In order to sample an element
from the set in a mass-weighted fashion, you can sample an element in a
uniform fashion and then reject the sample with a certain probability (more
below). If you do reject the sample, you keep drawing from the uniform
proposal distribution until you get an acceptance. This can be quite
efficient if the most massive element in the set is not too much more
massive than the majority of the other elements. If, however, the elements
have very different masses, then it can be inefficient. Regardless, my
intuition is that you can't do better than rejection sampling in this case,
because the probability distribution from which you're sampling is
constantly changing and not of any particular functional form.

In order to make rejection sampling work, I think you'll need to be able to
do two things:

1. Efficiently sample an element from the set in a uniform
(non-mass-weighted) fashion.
2. Efficiently look up the mass of the most massive element currently in the
set.

And you said you want set-like lookup by element, so here's a third
requirement:

3. Efficiently look up an element's mass by element.

The hard part is actually (1). In order to get (1), you somewhere need an
index to your elements indexed by index, so that you can uniformly sample an
element efficiently by saying (nth index (rand-int (count index))) or just
(rand-nth index). For this you need the index to be a vector, and this makes
it hard to get efficient deletions. Below I use weak deletions, but this
isn't ideal. This is why I asked about deletions. AFAIK, there's no way to
efficiently sample a uniform random element from a hashmap, hashset, sorted
map, or sorted set... you would need to call seq on any of those to query it
by index, which would be really bad for lookup times.

But here's the idea. I think it's not bad anyway.

Let's say you have j objects object_i with masses mass_i, where 0 <= i < j.
Then the data structure I have in mind is composed of:

1. A vector v whose elements are object_i.
2. A sorted set index ss whose elements are [mass_i, object_i] and whose
comparator is #(> (first %1) (first %2)) - so, sorted from largest to
smallest.
3. A hashmap index h whose key->value pairs are object_i->[mass_i,
vindex_i], where vindex_i is the index of object_i in the vector v. This
allows O(log32 n) access to the mass of each element and also to the vector
index of each element.
4. A scalar tm holding the total mass of all objects in the set.

To insert an object, simply conj it onto each structure as appropriate and
update the total mass.

To delete an element object_k, first look up its mass_k and vindex_k in the
hashmap, and then return a data structure with:

1. Vector (assoc v vindex_k ::unused), i.e. a weak deletion. This is
problematic because the length of the vector never decreases with deletions,
and this affects the efficiency of sampling later. I can't think of any way
around this, because we need an index indexed by numbers.
2. Sorted set (disj ss [mass_k, object_k]).
3. Hashmap (disj h object_k).
4. Scalar (- tm mass_k).

To sample in a mass-weighted manner, we do rejection sampling. Rejection
sampling has two steps: sampling from a uniform proposal distribution over
the elements in the set, and then accepting the sampled element with some
probability (and, if it is rejected, sampling another). The acceptance
probability depends on the mass of the sampled element and also on the mass
of the largest element in the set, because we need to make sure the proposal
distribution, which is a uniform times some factor, envelopes the
distribution from which we are trying to sample.

As an algorithm over v, ss, h, and tm:

1. Sample an integer pindex ("proposal index") from a uniform distribution
over [0, (count v)), i.e. let pindex be (rand-int (count v)).
2. Let p ("proposal") be (nth v pindex). This step is O(log32 n). If p is
::unused, then go back to step 1 -- in other words, keep sampling until we
get a uniform random element of v.
3. Look up the mass of p, i.e. let mass_p be (first (h p)). This step is
O(log32 n).
4. Look up the mass of the most massive element in the set, i.e. let
max-mass be (first (first ss)). This step is O(1).
5. Let n be (count v). Accept and return the proposal with probability
mass_p/max-mass = (mass_p/tm)/((1/n)*(max-mass*n/tm)), and otherwise reject
and go back to step 1. This step is fast if there aren't too many rejections
on average.

On multicore hardware, instead of looping from step 5 back to step 1, you
could do things in parallel: put steps 1-5 in a function f yielding an
accepted value or a nil for rejection, and then do (some (apply pcalls
(repeat f))) to get an acceptance.

I hope this helps. Definitely you should double-check my reasoning =).

Garth


On Sat, Jun 26, 2010 at 3:24 PM, Nicolas Oury <nicolas.o...@gmail.com>wrote:

> Insertion and deletions. But I would like to hear your idea anyway.
> Always good to hear ideas :)
>
>
> On Sat, Jun 26, 2010 at 7:58 PM, Garth Sheldon-Coulson <g...@mit.edu>wrote:
>
>> Are there going to be a lot of deletions from the set? Or mostly
>> insertions?
>>
>> If it's mostly insertions (or if it's just a static data structure that
>> stays the same once built) then I think I can help.
>>
>> On Sat, Jun 26, 2010 at 9:01 AM, Nicolas Oury <nicolas.o...@gmail.com>wrote:
>>
>>> Dear all,
>>>
>>> for a project, I need a data structure to represent finite distribution,
>>> that is a set of elements, each of which having a mass in the set, and two
>>> functions:
>>>
>>> - total-mass : the sum of the masses of each element
>>> - draw : return an element with a probability proportional to its mass
>>>
>>> I currently use the simple idea : a Persistent Hash Map from elt to
>>> mass:
>>> - the total-mass is the reduce of (+ . second)
>>> - drawing is choosing a random-number between 0 and the total-mass and
>>> looping through the seq, to look for the corresponding element.
>>>
>>> Both operation are linear in the number of elements. This is a problem,
>>> because these sets tend to be quite big in my project.
>>>
>>> I think that if I used a HashTrie with total-mass cached in each node,
>>> total-mass would be in constant time and draw in log time.
>>>
>>> So here is my question : does anyone has, for example for the "clojure in
>>> clojure" project, a sketch of an implementation of PersistentHashMap
>>> in clojure itself? (I would be happy to contribute, if it needs more
>>> work).
>>> Or has anyone a better idea that mine to have drawable sets?
>>>
>>> Thanks for your help,
>>>
>>> Best regards,
>>>
>>> Nicolas.
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clojure@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to