My students are working with several SPSS dataset provided by the
European Social Survey. If you register your name, you can download it
too. This is the 2004 data, for example:
http://ess.nsd.uib.no/ess/round2/
I cannot give you the European Survey dataset, but you can download it
for free if you like, and then you could run these commands to
re-produce this weird pattern described below.
library(foreign)
d2 <- read.spss("ESS3e03_2.por")
warnings()
str(d2$HAPPY)
d2 <- as.data.frame(d2)
str(d2$HAPPY)
d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
warnings()
str(d2$HAPPY)
Here's my info for this example:
sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-pc-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-38
The weirdness that follows is the difference between
d2 <- read.spss( ... , to.data.frame=T)
and
d2 <- read.spss ()
d2 <- as.data.frame(d2)
The former causes all data to become <NA> but the latter seems mostly OK.
library(foreign)
d2 <- read.spss("ESS3e03_2.por")
warnings()
There were 12 warnings (use warnings() to see them)
Warning messages:
1: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
2: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
3: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know", ... :
duplicated levels will not be allowed in factors anymore
4: In `levels<-`(`*tmp*`, value = c("No second language mentioned", ... :
duplicated levels will not be allowed in factors anymore
5: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl", ... :
duplicated levels will not be allowed in factors anymore
6: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
folkskola/grundskola\"", ... :
duplicated levels will not be allowed in factors anymore
7: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
8: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
9: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt", ... :
duplicated levels will not be allowed in factors anymore
10: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti", ... :
duplicated levels will not be allowed in factors anymore
11: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias", ... :
duplicated levels will not be allowed in factors anymore
12: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige", ... :
duplicated levels will not be allowed in factors anymore
str(d2$HAPPY)
Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...
d2 <- as.data.frame(d2)
str(d2$HAPPY)
Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...
That appears valid. On my first effort, I had tried to get the data
frame in a single shot with read.spss
d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
There were 15 warnings (use warnings() to see them)
warnings()
Warning messages:
1: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
longer object length is not a multiple of shorter object length
2: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
longer object length is not a multiple of shorter object length
3: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
longer object length is not a multiple of shorter object length
4: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
5: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
6: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know", ... :
duplicated levels will not be allowed in factors anymore
7: In `levels<-`(`*tmp*`, value = c("No second language mentioned", ... :
duplicated levels will not be allowed in factors anymore
8: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl", ... :
duplicated levels will not be allowed in factors anymore
9: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
folkskola/grundskola\"", ... :
duplicated levels will not be allowed in factors anymore
10: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
11: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
12: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt", ... :
duplicated levels will not be allowed in factors anymore
13: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti", ... :
duplicated levels will not be allowed in factors anymore
14: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias", ... :
duplicated levels will not be allowed in factors anymore
15: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige", ... :
duplicated levels will not be allowed in factors anymore
> str(d2$HAPPY)
Factor w/ 13 levels "Extremely unhappy",..: NA NA NA NA NA NA NA NA NA NA ...
Oh, heck, all the values are missing!! Somehow, putting
"to.data.frame" inside the read.spss causes a different outcome than
using as.data.frame after reading in the data.
The symptoms of this in R-2.9 are a little different, but the
conclusion the same. Help?
In case you are a student who wants to work with this data, I can
share to you the large script that I have been accumulating so that
you might "play along". It turns out to be surprisingly difficult to
"recode" these factor variables that have levels like "none", "1",
"2",..."9", "total".
## Paul Johnson
## November 13, 2009
## A question arose in the lab. A student asks "I want
## to compare the answers from two different editions
## of the European Social Survey.
## I will add this to Stuff Worth Knowing later, but
## I can share this tutorial to you right now.
## From this website:
## http://ess.nsd.uib.no/ess
## Download those European Social Survey Datasets into a directory.
## In a terminal, use the unzip command:
## unzip ESS3e03_2.spss.zip
## unzip ESS2e03_1.spss.zip
## Then run the following in R.
library(foreign)
d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
d2 <- read.spss("ESS3e03_2.por")
warnings()
### You can try to go into a data frame in one
### step, that's an option in read.spss. But
### we saw warnings, and wanted to be careful.
d2 <- as.data.frame(d2)
d2$whichSurvey <- 2
d3 <- read.spss("ESS2e03_1.por")
d3 <- as.data.frame(d3)
d3$whichSurvey <- 3
namesd2 <- names(d2)
namesd3 <- names(d3)
commonNames <- intersect( namesd3, namesd2)
combod23 <- rbind(d2[ , commonNames], d3[, commonNames])
save(combod23, file="combod23.Rda")
## Error
##Warning messages:
##1: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
## invalid factor level, NAs generated
##2: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
## invalid factor level, NAs generated
##3: In `[<-.factor`(`*tmp*`, ri, value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, :
## invalid factor level, NAs generated
## That worries me a little bit. The warnings did too.
## Inspect a few lines in the result.
combod23[1:4, ]
## fix doesn't work for me, did not bother to investigate.
##> fix(combod23)
##Error in edit.data.frame(get(subx, envir = parent), title = subx, ...) :
## can only handle vector and factor elements
## That means some data from hell came into this thing.
## I suspect that combod23 is OK.
## The memory use on this exercise is huge! Try to help it
rm (d2)
rm (d3)
## But I worry. I have 2 ways that I use to try to figure this
## out. One is to open the dataset in a clone of SPSS called
## "PSPP". Actually, the executable is "psppire".
##
## The other thing I do is open the same data again in
## a numeric format, and compare the 2 combined data frames
## This is also a useful exercise because it helps you
## understand what a "factor" is in R.
dn2 <- read.spss("ESS3e03_2.por", use.value.labels = F)
dn2 <- as.data.frame(dn2)
dn2$whichSurvey <- 2
dn3 <- read.spss("ESS2e03_1.por", use.value.labels = F)
dn3 <- as.data.frame(dn3)
dn3$whichSurvey <- 3
## Might be smart to compare
# dn2$HAPPY[1:50]
# d2$HAPPY[1:50]
namesdn2 <- names(dn2)
namesdn3 <- names(dn3)
commonNNames <- intersect( namesdn3, namesdn2 )
combodn23 <- rbind(dn2[ , commonNNames], dn3[, commonNNames])
save(combodn23, file="combodn23.Rda")
table( combod23$HAPPY, combodn23$HAPPY)
## In summary, whenever I want to use a variable from
## the combined data frame, I would probably compare
## against combodn23 just to feel safe.
## Note, after when you come back to work on this project again, you
## might as well just reload the saved copies of combod23 and
## combodn23.
## load("combod23.Rda")
## load("combodn23.Rda")
## That will put you at the current spot, no need to redo the merge
## Now, about "recoding". If you just want numerical
## data, you might consider using combodn23.
## But if you want some factors and some numberical
## variables, then you might need to recode to reclaim
## values.
## HAPPY turns out to be an interesting example of a
## PAIN IN THE ASS because in SPSS, it is scored from
## 0 to 10, but they give value labels only for scores
## 1= Extremely unhappy
## and
## 10= Extremely happy
##
## And the SPSS column has no labels for values 1-9.
## If SPSS gave NO labels at all, then this would come
## into R as a numeric variable. BUT, because there are
## 2 levels named, then R makes a factor out of it.
## When R turns it into a factor, you
## end up with a nutty looking factor, which has
## levels you don't really appreciate.
levels(combod23$HAPPY)
# [1] "Extremely unhappy" "1" "2"
# [4] "3" "4" "5"
# [7] "6" "7" "8"
#[10] "9" "Extremely happy" "Refusal"
#[13] "Don't know" "No answer"
## Create a new variable to play with
combod23$HAPPY2 <- combod23$HAPPY
## Change Extremely Unhappy to text "0"
levels(combod23$HAPPY)[1] <- "0"
## Change Extremely Happy to "10"
levels(combod23$HAPPY)[11] <- "10"
HELL <- levels(combod23$HAPPY)
### Look at HELL
HELL
combod23$HAPPY2[combod23$HAPPY %in% HELL[12:14] ] <- NA
##CHECK RESULT
table(combod23$HAPPY, combod23$HAPPY2)
## Eliminate the unused levels from HAPPY2
combod23$HAPPY2 <- factor(combod23$HAPPY2)
### Same is found with
## combo23$HAPPY2 <- combo23$HAPPY2[ , drop=T]
## Use the "factor trick" to
## reset the variable back to numeric:
combod23$HAPPYN <- as.numeric(HELL)[combod23$HAPPYN]
##CHECK RESULT
table(combod23$HAPPY, combod23$HAPPY2)
## CHECK by comparing against numeric data from spss
table(combodn23$HAPPY, combod23$HAPPYN)
## Next, a student asks "how can I make that same recode
## on a lot of variables?" I'm going to have to leave
## that one unanswered. I think the answer will probably
## be to get a list of variables, then use "lapply" to
## do the same thing to each variable in turn. But
## I have not written up a simple, understandable example
## yet
## After the data is all recoded and homogenized, then we
## could run any analysis we want, and throw in the variable
## "whichSurvey" to see if there is a difference beteween the
## two models.
## Example, choose your y and x1 and x2, then
## mod <- lm(y~ (x1+x2)*whichSurvey, data=combod23)
## or if you think the difference is just in the intercept:
## mod <- lm(y~ x1+x2 + whichSurvey, data=combod23)