Hi Pontus,

I have not seen this approach before, but it is certainly interesting and 
potentially useful.
Without giving an answer to your questions, some things to consider:

1. Small probabilities will be the hardest to represent.

2. Maybe there is an exact method possible by storing in each bin not just 
the value but also a real value, then using a rejection-sampling step.  By 
adjusting the real value you can make the sampler exact, leaving the number 
of total bins used as a free choice (determining efficiency), as long as 
the number of bins is at least as large as the number of outcomes.

3. The key reference on random sampling is Marsaglia's book on the topic, 
which may be worth checking out, 
https://books.google.co.uk/books?id=KJzlBwAAQBAJ

4. If your goal is to generate multiple random draws (instead of just one), 
a very interesting idea for efficiency comes from the literature on 
resampling methods in particle filtering.  There the so called systematic 
resampling method (e.g. Section 2.4 in http://arxiv.org/pdf/cs/0507025.pdf) 
manages to draw N "samples" in O(N) time and O(1) space from a distribution 
over N outcomes with arbitrary probability distribution over these N 
elements.  It is probably the most commonly used resampling method in 
particle filtering.  However, these samples are not iid, but much like 
quasi-Monte Carlo samples guarantee unbiased expectations, see the above 
paper for some further pointers.

In any case, an interesting idea (trading runtime for space), let us know 
if you find literature on this topic.

Best,
Sebastian

On Tuesday, 26 May 2015 16:42:20 UTC+1, Pontus Stenetorp wrote:
>
> Everyone, 
>
> tl;dr: When trading memory for computation time by sampling from a 
> discrete probability distribution by sampling uniformly from a vector, 
> is there a good way to decide on the size of the vector?  Code 
> attached to the e-mail. 
>
> First of all, my apologies if this has been answered previously.  I 
> seem to remember it being at least mentioned, but I can not seem to 
> find anything on it. 
>
> Given that we have a discrete distribution defined by a vector of 
> probabilities, say: 
>
>     probs = [0.1, 0.2, 0.5, 0.1, 0.1] 
>
> A fairly common and straightforward way to sample from this 
> distribution, while trading memory for computation time, is to 
> allocate a vector from which can then sample uniformly using 
> `rand(1:end)`.  For the above example: 
>
>     tbl = [1, 2, 2, 3, 3, 3, 3, 3, 4, 5] 
>
> Which we can generate as follows: 
>
>     i = 1 
>     for j in 1:length(tbl) 
>         tbl[j] = i 
>         if 1/length(tbl)*j >= sum(probs[1:i]) 
>             i += 1 
>         end 
>     end 
>
> What however struck me was that deciding on the size of `tbl` can be a 
> lot trickier for real examples.  If the table is too small, we will 
> end up with too coarse granularity and will be unable to represent all 
> the entries in `probs`.  If the table is too large, we will be wasting 
> memory. 
>
> It seems like there should be some sort of sweet spot and what I 
> usually do at that time is reach for a copy of "Numerical Recipes", 
> however, either I skimmed it very poorly or this approach was not 
> covered.  Even worse, I have been unable to search for what the 
> approach above is even called.  Does anyone know what this approach 
> would be referred to?  Or, even better, how would one best decide on 
> the size of `tbl`?  I am currently using `max(1/(minimum(probs)), 
> 2_000)`, which guarantees a granularity of at least 0.05% and that the 
> smallest probability in `probs` will be covered.  This works fine for 
> my application, but the perfectionist in me squirms whenever I think 
> that I can not come up with something more elegant. 
>
> Thank you in advance for any and all responses, code attached to the 
> e-mail. 
>
>     Pontus 
>

Reply via email to