Hi Duncan, Martin,
Thanks for your answers.
For my real case I was generating millions of random positions
on a genome.
I compared sample.int() performance between R-2.15.1 and R-devel,
and, for me, it performs better in R-2.15.1 (almost 3x faster and
also uses slightly less memory):
With R-2.15.1:
> set.seed(33)
> system.time(random_chrom_pos <- sample(199000666L, 95000777L))
user system elapsed
4.964 0.268 5.242
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 137285 7.4 350000 18.7 350000 18.7
Vcells 47633785 363.5 154735917 1180.6 147135703 1122.6
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
With R-devel:
> set.seed(33)
> system.time(random_chrom_pos <- sample(199000666L, 95000777L))
user system elapsed
14.532 0.296 14.854
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 145525 7.8 350000 18.7 350000 18.7
Vcells 47644082 363.5 152959996 1167.0 182023372 1388.8
> sessionInfo()
R Under development (unstable) (2012-10-02 r60861)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
FWIW my R-2.15.1 and R-devel were configured with
--disable-byte-compiled-packages, otherwise, I use all the
defaults. Also my system is a standard Ubuntu 12.04 installation
with no fancy settings/tweakings/customizations.
Thanks,
H.
On 10/20/2012 12:50 PM, Martin Maechler wrote:
Duncan Murdoch <murdoch.dun...@gmail.com>
on Fri, 19 Oct 2012 19:26:39 -0400 writes:
> On 12-10-19 7:04 PM, Hervé Pagès wrote:
>> Hi,
>>
>> Looks like the implementation of random number generation changed in
>> R-devel with respect to R-2.15.1.
>>
>> With R-2.15.1:
>>
>> > set.seed(33)
>> > sample(49821115, 10)
>> [1] 22217252 19661919 24099911 45779422 42043111 25774933 21778053
>> 17098516
>> [9] 773073 5878451
>>
>> With recent R-devel:
>>
>> > set.seed(33)
>> > sample(49821115, 10)
>> [1] 22217252 19661919 24099912 45779425 42043115 25774935 21778056
>> 17098518
>> [9] 773073 5878452
>>
>> This is on a 64-bit Ubuntu system.
>>
>> Is this change intended? I didn't see anything in the NEWS file.
>>
>> A potential problem with this is that it will break unit tests
>> for algorithms that make use of RNG.
>>
>> Another more practical problem (at least for me) is the following:
>> Bioconductor package maintainers are sometimes working hard on the
>> development version of their package to improve the performance of
>> some key functions. Comparing performance between BioC release
>> (based on R-2.15) and devel (based on R-devel) often requires big
>> input data that is randomly generated, because it's easiest than
>> working with real data. Typically a small script is written that
>> takes care of loading the required packages, generating the input
>> data, and running a simple analysis. The same script is sourced in
>> R-2.15 and R-devel, and performance and results are compared.
>>
>> Not being able to generate exactly the same input in the script is
>> a problem. It can be worked around by generating the input once,
>> serializing it, and use load() in the script, but that makes things
>> more complicated and the script is not a standalone script anymore
>> (cannot be passed around without also passing around the big .rda
>> file).
>>
>> Thanks,
>> H.
>>
> I think it was mentioned in the NEWS:
> \code{sample.int()} has some support for \eqn{n \ge
> 2^{31}}{n >= 2^31}: see its help for the limitations.
> A different algorithm is used for \code{(n, size, replace = FALSE,
> prob = NULL)} for \code{n > 1e7} and \code{size <= n/2}. This
> is much faster and uses less memory, but does give different results.
So, to iterate : The RNG has not been changed at all,
but sample() has, for extreme cases (large n) like yours.
> I don't think the old algorithm is available, but perhaps it could be
> made available by an optional parameter.
I do think we should ideally add such an option or probably
rather allow the more thorough way of either using
RNGversion(..) or something similar to set sample()'s behavior
to exactly as previously.
Doing "globally" is really needed, as sample() maybe called from a
function (from a function from a function) that is not in the
programmer's hand, and so the programmeR could not even
set the new optional argument if he found out that he had to.
Honestly, I'm surprised Hervé found a real case where the
difference is visible.
Martin
> Duncan Murdoch
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel