Thank you, everybody, who replied! I appreciate your valuable advise! I will move the location of the set.seed() command to after all packages have been installed and loaded.
Best regards, Gary Sent from my iPad > On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert...@gmail.com> wrote: > > On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack...@sbcglobal.net> > wrote: > > (Sorry to be a bit slow responding.) > > You have not supplied a complete example, which would be good in this case > because what you are suggesting could be a serious bug in R or a package. > Serious journals require reproducibility these days. For example, JSS is very > clear on this point. > > To your question > > My question simply is: should the location of the set.seed command matter, > > provided that it is applied before any commands which involve randomness > > (such as partitioning)? > > the answer is no, it should not matter. But the proviso is important. > > You can determine where things are messing up using something like > > set.seed(654321) > zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" > zk > z <- runif(2) > z > set.seed(654321) > > # install.packages(c('caret', 'ggplot2', 'e1071')) > library(caret) > all(runif(2) == z) # should be true but it is not always > > set.seed(654321) > library(ggplot2) > all(runif(2) == z) # should be true > > set.seed(654321) > library(e1071) > all(runif(2) == z) # should be true > > all(RNGkind() == zk) # should be true > > On my computer package caret seems to sometimes, but not always, do something > that advances or changes the RNG. So you will need to set the seed after that > package is loaded if you want reproducibility. > > As Bill Dunlap points out, parallel can introduce much more complicated > issues. If you are in fact using parallel then we really need a new thread > with a better subject line, and the discussion will get much messier. > > The short answer is that, yes you should be able to get reproducible results > with parallel computing. If you cannot then you are almost certainly doing > something wrong. To publish you really must have reproducible results. > > In the example that Bill gave, I think the problem is that set.seed() only > resets the seed in the main thread, the nodes continue to operate with > unreset RNG. To demonstrate this to yourself you can do > > library(parallel) > cl <- parallel::makeCluster(3) > parallel::clusterCall(cl, function()set.seed(100)) > parallel::clusterCall(cl, function()RNGkind()) > parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes > # [1] 0.3077661 0.2576725 > > However, do *NOT* do that in real work. You will be getting the same RNG > stream from each node. If you are using random numbers and parallel you need > to read a lot more, and probably consider a variant of the "L'Ecuyer" > generator or something designed for parallel computing. > > One special point I will mention because it does not seem to be widely > appreciated: the number of nodes affects the random stream, so recording the > number of compute nodes along with the RNG and seed information is important > for reproducible results. This has the unfortunate consequence that an > experiment cannot be reproduced on a larger cluster. (If anyone knows > differently I would very much like to hear.) > > Paul Gilbert > > > > Hi all, > > > > For some odd reason when running naïve bayes, k-NN, etc., I get slightly > > different results (e.g., error rates, classification probabilities) from > > run > > to run even though I am using the same random seed. > > > > Nothing else (input-wise) is changing, but my results are somewhat > > different > > from run to run. The only randomness should be in the partitioning, and I > > have set the seed before this point. > > > > My question simply is: should the location of the set.seed command matter, > > provided that it is applied before any commands which involve randomness > > (such as partitioning)? > > > > If you need to see the code, it is below: > > > > Thank you, > > Gary > > > > > > A. Separate the original (in-sample) data from the new (out-of-sample) > > data. Set a random seed. > > > >> InvestTech <- as.data.frame(InvestTechRevised) > >> outOfSample <- InvestTech[5001:nrow(InvestTech), ] > >> InvestTech <- InvestTech[1:5000, ] > >> set.seed(654321) > > > > B. Install and load the caret, ggplot2 and e1071 packages. > > > >> install.packages(“caret”) > >> install.packages(“ggplot2”) > >> install.packages(“e1071”) > >> library(caret) > >> library(ggplot2) > >> library(e1071) > > > > C. Bin the predictor variables with approximately equal counts using > > the cut_number function from the ggplot2 package. We will use 20 bins. > > > >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > > > D. Partition the original (in-sample) data into 60% training and 40% > > validation sets. > > > >> n <- nrow(InvestTech) > >> train <- sample(1:n, size = 0.6 * n, replace = FALSE) > >> InvestTechTrain <- InvestTech[train, ] > >> InvestTechVal <- InvestTech[-train, ] > > > > E. Use the naiveBayes function in the e1071 package to fit the model. > > > >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) > >> prob <- predict(model, newdata = InvestTechVal, type = “raw”) > >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > > > F. Use the confusionMatrix function in the caret package to output the > > confusion matrix. > > > >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = > > “everything”, positive = “1”) > >> accuracy <- confMtr$overall[1] > >> valError <- 1 – accuracy > >> confMtr > > > > G. Classify the 18 new (out-of-sample) readers using the following > > code. > >> prob <- predict(model, newdata = outOfSample, type = “raw”) > >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > >> cbind(pred, prob, outOfSample[, -3]) > > > > > > > > > If your computations involve the parallel package then set.seed(seed) > may not produce repeatable results. E.g., > > > cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local > host > > set.seed(100); runif(2) > [1] 0.3077661 0.2576725 > > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.7779 102.5308 103.3459 > [2,] 101.8128 102.6114 103.9102 > > > > set.seed(100); runif(2) > [1] 0.3077661 0.2576725 > > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.1628 102.9643 103.2684 > [2,] 101.9205 102.6937 103.7907 > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.