Thank you, everybody, who replied!  I appreciate your valuable advise!  I will 
move the location of the set.seed() command to after all packages have been 
installed and loaded.

Best regards,
Gary

Sent from my iPad

> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert...@gmail.com> wrote:
> 
> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack...@sbcglobal.net>
> wrote:
> 
> (Sorry to be a bit slow responding.)
> 
> You have not supplied a complete example, which would be good in this case 
> because what you are suggesting could be a serious bug in R or a package. 
> Serious journals require reproducibility these days. For example, JSS is very 
> clear on this point.
> 
> To your question
> > My question simply is:  should the location of the set.seed command matter,
> > provided that it is applied before any commands which involve randomness
> > (such as partitioning)?
> 
> the answer is no, it should not matter. But the proviso is important.
> 
> You can determine where things are messing up using something like
> 
> set.seed(654321)
> zk <- RNGkind()        # [1] "Mersenne-Twister" "Inversion"
> zk
> z <- runif(2)
> z
> set.seed(654321)
> 
> #  install.packages(c('caret', 'ggplot2', 'e1071'))
> library(caret)
> all(runif(2)  == z)   # should be true but it is not always
> 
> set.seed(654321)
> library(ggplot2)
> all(runif(2)  == z)   # should be true
> 
> set.seed(654321)
> library(e1071)
> all(runif(2)  == z)   # should be true
> 
> all(RNGkind() == zk)  # should be true
> 
> On my computer package caret seems to sometimes, but not always, do something 
> that advances or changes the RNG. So you will need to set the seed after that 
> package is loaded if you want reproducibility.
> 
> As Bill Dunlap points out, parallel can introduce much more complicated 
> issues. If you are in fact using parallel then we really need a new thread 
> with a better subject line, and the discussion will get much messier.
> 
> The short answer is that, yes you should be able to get reproducible results 
> with parallel computing. If you cannot then you are almost certainly doing 
> something wrong. To publish you really must have reproducible results.
> 
> In the example that Bill gave, I think the problem is that set.seed() only 
> resets the seed in the main thread, the nodes continue to operate with 
> unreset RNG. To demonstrate this to yourself you can do
> 
> library(parallel)
> cl <- parallel::makeCluster(3)
> parallel::clusterCall(cl, function()set.seed(100))
> parallel::clusterCall(cl, function()RNGkind())
> parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes
>                                               # [1] 0.3077661 0.2576725
> 
> However, do *NOT* do that in real work. You will be getting the same RNG 
> stream from each node. If you are using random numbers and parallel you need 
> to read a lot more, and probably consider a variant of the "L'Ecuyer" 
> generator or something designed for parallel computing.
> 
> One special point I will mention because it does not seem to be widely
> appreciated: the number of nodes affects the random stream, so recording the 
> number of compute nodes along with the RNG and seed information is important 
> for reproducible results. This has the unfortunate consequence that an 
> experiment cannot be reproduced on a larger cluster. (If anyone knows 
> differently I would very much like to hear.)
> 
> Paul Gilbert
> 
> 
> > Hi all,
> >
> > For some odd reason when running naïve bayes, k-NN, etc., I get slightly
> > different results (e.g., error rates, classification probabilities) from
> > run
> > to run even though I am using the same random seed.
> >
> > Nothing else (input-wise) is changing, but my results are somewhat
> > different
> > from run to run.  The only randomness should be in the partitioning, and I
> > have set the seed before this point.
> >
> > My question simply is:  should the location of the set.seed command matter,
> > provided that it is applied before any commands which involve randomness
> > (such as partitioning)?
> >
> > If you need to see the code, it is below:
> >
> > Thank you,
> > Gary
> >
> >
> > A.      Separate the original (in-sample) data from the new (out-of-sample)
> > data.  Set a random seed.
> >
> >> InvestTech <- as.data.frame(InvestTechRevised)
> >> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
> >> InvestTech <- InvestTech[1:5000, ]
> >> set.seed(654321)
> >
> > B.      Install and load the caret, ggplot2 and e1071 packages.
> >
> >> install.packages(“caret”)
> >> install.packages(“ggplot2”)
> >> install.packages(“e1071”)
> >> library(caret)
> >> library(ggplot2)
> >> library(e1071)
> >
> > C.      Bin the predictor variables with approximately equal counts using
> > the cut_number function from the ggplot2 package.  We will use 20 bins.
> >
> >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
> >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
> >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
> >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
> >
> > D.      Partition the original (in-sample) data into 60% training and 40%
> > validation sets.
> >
> >> n <- nrow(InvestTech)
> >> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
> >> InvestTechTrain <- InvestTech[train, ]
> >> InvestTechVal <- InvestTech[-train, ]
> >
> > E.      Use the naiveBayes function in the e1071 package to fit the model.
> >
> >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain)
> >> prob <- predict(model, newdata = InvestTechVal, type = “raw”)
> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
> >
> > F.      Use the confusionMatrix function in the caret package to output the
> > confusion matrix.
> >
> >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode =
> > “everything”, positive = “1”)
> >> accuracy <- confMtr$overall[1]
> >> valError <- 1 – accuracy
> >> confMtr
> >
> > G.      Classify the 18 new (out-of-sample) readers using the following
> > code.
> >> prob <- predict(model, newdata = outOfSample, type = “raw”)
> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
> >> cbind(pred, prob, outOfSample[, -3])
> >
> >
> >
> 
> 
> If your computations involve the parallel package then set.seed(seed)
> may not produce repeatable results.  E.g.,
> 
> > cl <- parallel::makeCluster(3)  # Create cluster with 3 nodes on local
> host
> > set.seed(100); runif(2)
> [1] 0.3077661 0.2576725
> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>         [,1]     [,2]     [,3]
> [1,] 101.7779 102.5308 103.3459
> [2,] 101.8128 102.6114 103.9102
> >
> > set.seed(100); runif(2)
> [1] 0.3077661 0.2576725
> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>         [,1]     [,2]     [,3]
> [1,] 101.1628 102.9643 103.2684
> [2,] 101.9205 102.6937 103.7907
> 
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to