Re: [R] Random Seed Location

Duncan Murdoch Sun, 04 Mar 2018 15:24:06 -0800

On 04/03/2018 5:54 PM, Henrik Bengtsson wrote:

The following helps identify when .GlobalEnv$.Random.seed has changed:


rng_tracker <- local({
   last <- .GlobalEnv$.Random.seed
   function(...) {
     curr <- .GlobalEnv$.Random.seed
     if (!identical(curr, last)) {
       warning(".Random.seed changed")
       last <<- curr
     }
     TRUE
   }
})

addTaskCallback(rng_tracker, name = "RNG tracker")


EXAMPLE:

sample.int(1L)

[1] 1
Warning: .Random.seed changed

This will help you find for instance:

## Loading ggplot2 does not affect the RNG seed

loadNamespace("ggplot2")

<environment: namespace:ggplot2>

## But attaching it does

library("ggplot2")

Warning: .Random.seed changed

which reveals:

ggplot2:::.onAttach

function (...)
{
     if (!interactive() || stats::runif(1) > 0.1)
         return()
     tips <- c("Need help? Try the ggplot2 mailing list:
http://groups.google.com/group/ggplot2.";,
         "Find out what's changed in ggplot2 at
http://github.com/tidyverse/ggplot2/releases.";,
         "Use suppressPackageStartupMessages() to eliminate package
startup messages.",
         "Stackoverflow is a great place to get help:
http://stackoverflow.com/tags/ggplot2.";,
         "Need help getting started? Try the cookbook for R:
http://www.cookbook-r.com/Graphs/";,
         "Want to understand how all the pieces fit together? Buy the
ggplot2 book: http://ggplot2.org/book/";)
     tip <- sample(tips, 1)
     packageStartupMessage(paste(strwrap(tip), collapse = "\n"))
}
<environment: namespace:ggplot2>

There are probably many case of this in different R packages.


R WISH:

There could be a

preserveRandomSeed({
   tip <- sample(tips, 1)
})

function in R for these type of random needs where true random
properties are non-critical.  This type of
"draw-a-random-number-and-reset-the-seed" is for instance used in
parallel:::initDefaultClusterOptions() which is called when the
'parallel' package is loaded:

seed <- .GlobalEnv$.Random.seed
ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max
port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1)
if(is.null(seed)) ## there was none, initially
rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
else # reset
assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE)

An issue is that .Random.seed doesn't contain the full state of the RNGsystem, so restoring it doesn't necessarily lead to an identicalsequence of output. The only way to guarantee the sequence will repeatis to call set.seed(n), and that only leads to a tiny fraction ofpossible states.

Expanding .Random.seed so that it does contain the full state would be agood idea, and that would make your preserveRandomSeed really easy to write.


Here's a demo that .Random.seed is not enough:

> set.seed(123, normal.kind = "Box-Muller")
> rnorm(1)
[1] -0.1613431
> save <- .Random.seed
> rnorm(1)
[1] 0.6706031
> .Random.seed <- save
> rnorm(1)
[1] -0.4194403

If .Random.seed were the complete state, the 2nd and 3rd rnorm resultswould be the same.


Duncan Murdoch


/Henrik

On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <gwblack...@sbcglobal.net> wrote:

Thank you, everybody, who replied!  I appreciate your valuable advise!  I will 
move the location of the set.seed() command to after all packages have been 
installed and loaded.

Best regards,
Gary

Sent from my iPad

On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert...@gmail.com> wrote:

On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack...@sbcglobal.net>
wrote:

(Sorry to be a bit slow responding.)

You have not supplied a complete example, which would be good in this case 
because what you are suggesting could be a serious bug in R or a package. 
Serious journals require reproducibility these days. For example, JSS is very 
clear on this point.

To your question

My question simply is:  should the location of the set.seed command matter,
provided that it is applied before any commands which involve randomness
(such as partitioning)?


the answer is no, it should not matter. But the proviso is important.

You can determine where things are messing up using something like

set.seed(654321)
zk <- RNGkind()        # [1] "Mersenne-Twister" "Inversion"
zk
z <- runif(2)
z
set.seed(654321)

#  install.packages(c('caret', 'ggplot2', 'e1071'))
library(caret)
all(runif(2)  == z)   # should be true but it is not always

set.seed(654321)
library(ggplot2)
all(runif(2)  == z)   # should be true

set.seed(654321)
library(e1071)
all(runif(2)  == z)   # should be true

all(RNGkind() == zk)  # should be true

On my computer package caret seems to sometimes, but not always, do something 
that advances or changes the RNG. So you will need to set the seed after that 
package is loaded if you want reproducibility.

As Bill Dunlap points out, parallel can introduce much more complicated issues. 
If you are in fact using parallel then we really need a new thread with a 
better subject line, and the discussion will get much messier.

The short answer is that, yes you should be able to get reproducible results 
with parallel computing. If you cannot then you are almost certainly doing 
something wrong. To publish you really must have reproducible results.

In the example that Bill gave, I think the problem is that set.seed() only 
resets the seed in the main thread, the nodes continue to operate with unreset 
RNG. To demonstrate this to yourself you can do

library(parallel)
cl <- parallel::makeCluster(3)
parallel::clusterCall(cl, function()set.seed(100))
parallel::clusterCall(cl, function()RNGkind())
parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes
                                               # [1] 0.3077661 0.2576725

However, do *NOT* do that in real work. You will be getting the same RNG stream from each 
node. If you are using random numbers and parallel you need to read a lot more, and 
probably consider a variant of the "L'Ecuyer" generator or something designed 
for parallel computing.

One special point I will mention because it does not seem to be widely
appreciated: the number of nodes affects the random stream, so recording the 
number of compute nodes along with the RNG and seed information is important 
for reproducible results. This has the unfortunate consequence that an 
experiment cannot be reproduced on a larger cluster. (If anyone knows 
differently I would very much like to hear.)

Paul Gilbert

Hi all,

For some odd reason when running naïve bayes, k-NN, etc., I get slightly
different results (e.g., error rates, classification probabilities) from
run
to run even though I am using the same random seed.

Nothing else (input-wise) is changing, but my results are somewhat
different
from run to run.  The only randomness should be in the partitioning, and I
have set the seed before this point.

My question simply is:  should the location of the set.seed command matter,
provided that it is applied before any commands which involve randomness
(such as partitioning)?

If you need to see the code, it is below:

Thank you,
Gary


A.      Separate the original (in-sample) data from the new (out-of-sample)
data.  Set a random seed.

InvestTech <- as.data.frame(InvestTechRevised)
outOfSample <- InvestTech[5001:nrow(InvestTech), ]
InvestTech <- InvestTech[1:5000, ]
set.seed(654321)


B.      Install and load the caret, ggplot2 and e1071 packages.

install.packages(“caret”)
install.packages(“ggplot2”)
install.packages(“e1071”)
library(caret)
library(ggplot2)
library(e1071)


C.      Bin the predictor variables with approximately equal counts using
the cut_number function from the ggplot2 package.  We will use 20 bins.

InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)


D.      Partition the original (in-sample) data into 60% training and 40%
validation sets.

n <- nrow(InvestTech)
train <- sample(1:n, size = 0.6 * n, replace = FALSE)
InvestTechTrain <- InvestTech[train, ]
InvestTechVal <- InvestTech[-train, ]


E.      Use the naiveBayes function in the e1071 package to fit the model.

model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain)
prob <- predict(model, newdata = InvestTechVal, type = “raw”)
pred <- ifelse(prob[, 2] >= 0.3, 1, 0)


F.      Use the confusionMatrix function in the caret package to output the
confusion matrix.

confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode =

“everything”, positive = “1”)

accuracy <- confMtr$overall[1]
valError <- 1 – accuracy
confMtr


G.      Classify the 18 new (out-of-sample) readers using the following
code.

prob <- predict(model, newdata = outOfSample, type = “raw”)
pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
cbind(pred, prob, outOfSample[, -3])



If your computations involve the parallel package then set.seed(seed)
may not produce repeatable results.  E.g.,

cl <- parallel::makeCluster(3)  # Create cluster with 3 nodes on local

host

set.seed(100); runif(2)

[1] 0.3077661 0.2576725

parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))

         [,1]     [,2]     [,3]
[1,] 101.7779 102.5308 103.3459
[2,] 101.8128 102.6114 103.9102


set.seed(100); runif(2)

[1] 0.3077661 0.2576725

parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))

         [,1]     [,2]     [,3]
[1,] 101.1628 102.9643 103.2684
[2,] 101.9205 102.6937 103.7907


Bill Dunlap
TIBCO Software
wdunlap tibco.com


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Seed Location

Reply via email to