My purpose involves creating a dissimilarity matrix using the daisy package in R before applying k-mediod clustering for customer segmentation. The dataset has 133,153 observations of 35 variables in a data.frame with numerical, categorical, blank cells and missing values. Missing values refer to NA, while a blank cells means nothing present within the data.frame.
Hereâs my OS: > sessionInfo() R version 3.1.0 (2014-04-10) Platform x86_64-w64-mingw32/x64 (64-bit) I have 35 variables, but here is description of the first 5: > head(df) user_id Age Gender Household.Income Marital.Status 1 12945 Male 2 12947 Male 3 12990 4 13160 25-34 Male 100k-125k Single 5 13195 Male 75k-100k Single 6 13286 Since the Windows computer has 3 Gb RAM, I increased the virtual memory to 100Gb hoping that would be enough to create the matrix - it didn't work. I've looked into other R packages for solving the memory problem, but they don't work. I cannot use the `bigmemory` with the `biganalytics` package because it only accepts numeric matrices. The `clara` and `ff` packages also accept only numeric matrices. Here's the daisy script: #Load csv file > Store1 <- read.csv("/Users/name/Client1.csv", head = TRUE) #Convert csv to data.frame > df <-as.data.frame(Store1) #Increase memory allocation in R to 70 GB using the command: > memory.limit(size = 70000) [1] 70000 #Load cluster package > library(cluster) #Create daisy dissimilarity matrix #Use Gower distance coefficient for mixed variables #Set type as ratio scaled variable > daisy1 <- daisy(df, metric = "gowerâ, type = list(ordratio = c(1:35))) #Error: cannot allocate vector of size 66.0 Gb How can I fix the error? -- Scott Davis Cell: (408)826-9561 Skype ID: Scdavis61 San Jose, CA. [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.