Re: [R] spss imports--trouble with to.data.frame

Peter Ehlers Fri, 13 Nov 2009 15:39:21 -0800

I can't really help you with your problem, but maybe
importing with use.value.labels=FALSE will at least
get rid of the 'duplicated levels' warnings.


 -Peter Ehlers

Paul Johnson wrote:

My students are working with several SPSS dataset provided by the
European Social Survey. If you register your name, you can download it
too. This is the 2004 data, for example:

http://ess.nsd.uib.no/ess/round2/

I cannot give you the European Survey dataset, but you can download it
for free if you like, and then you could run these commands to
re-produce this weird pattern described below.

library(foreign)
d2 <- read.spss("ESS3e03_2.por")
warnings()

str(d2$HAPPY)
d2 <- as.data.frame(d2)
str(d2$HAPPY)

d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
warnings()
str(d2$HAPPY)

Here's my info for this example:

sessionInfo()

R version 2.10.0 (2009-10-26)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] foreign_0.8-38


The weirdness that follows is the difference between

d2 <- read.spss( ... , to.data.frame=T)

and

d2 <- read.spss ()
d2 <- as.data.frame(d2)

The former causes all data to become <NA> but the latter seems mostly OK.

library(foreign)
d2 <- read.spss("ESS3e03_2.por")

warnings()
There were 12 warnings (use warnings() to see them)

Warning messages:

1: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
  duplicated levels will not be allowed in factors anymore
2: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
  duplicated levels will not be allowed in factors anymore
3: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know",  ... :
  duplicated levels will not be allowed in factors anymore
4: In `levels<-`(`*tmp*`, value = c("No second language mentioned",  ... :
  duplicated levels will not be allowed in factors anymore
5: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl",  ... :
  duplicated levels will not be allowed in factors anymore
6: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
folkskola/grundskola\"",  ... :
  duplicated levels will not be allowed in factors anymore
7: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers",  ... :
  duplicated levels will not be allowed in factors anymore
8: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers",  ... :
  duplicated levels will not be allowed in factors anymore
9: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt",  ... :
  duplicated levels will not be allowed in factors anymore
10: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti",  ... :
  duplicated levels will not be allowed in factors anymore
11: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias",  ... :
  duplicated levels will not be allowed in factors anymore
12: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige",  ... :
  duplicated levels will not be allowed in factors anymore

str(d2$HAPPY)

 Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...

d2 <- as.data.frame(d2)
str(d2$HAPPY)

 Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...

That appears valid.  On my first effort, I had tried to get the data
frame in a single shot with read.spss

d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)

There were 15 warnings (use warnings() to see them)

warnings()

Warning messages:
1: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
2: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
3: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
4: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
  duplicated levels will not be allowed in factors anymore
5: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ",  ... :
  duplicated levels will not be allowed in factors anymore
6: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know",  ... :
  duplicated levels will not be allowed in factors anymore
7: In `levels<-`(`*tmp*`, value = c("No second language mentioned",  ... :
  duplicated levels will not be allowed in factors anymore
8: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl",  ... :
  duplicated levels will not be allowed in factors anymore
9: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
folkskola/grundskola\"",  ... :
  duplicated levels will not be allowed in factors anymore
10: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers",  ... :
  duplicated levels will not be allowed in factors anymore
11: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers",  ... :
  duplicated levels will not be allowed in factors anymore
12: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt",  ... :
  duplicated levels will not be allowed in factors anymore
13: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti",  ... :
  duplicated levels will not be allowed in factors anymore
14: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias",  ... :
  duplicated levels will not be allowed in factors anymore
15: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige",  ... :
  duplicated levels will not be allowed in factors anymore

 > str(d2$HAPPY)
 Factor w/ 13 levels "Extremely unhappy",..: NA NA NA NA NA NA NA NA NA NA ...

Oh, heck, all the values are missing!! Somehow, putting
"to.data.frame" inside the read.spss causes a different outcome than
using as.data.frame after reading in the data.

The symptoms of this in R-2.9 are a little different, but the
conclusion the same.  Help?

In case you are a student who wants to work with this data, I can
share to you the large script that I have been accumulating so that
you might "play along".  It turns out to be surprisingly difficult to
"recode" these factor variables that have levels like "none", "1",
"2",..."9", "total".



## Paul Johnson
## November 13, 2009

## A question arose in the lab. A student asks "I want
## to compare the answers from two different editions
## of the European Social Survey.

## I will add this to Stuff Worth Knowing later, but
## I can share this tutorial to you right now.

## From this website:

## http://ess.nsd.uib.no/ess

## Download those European Social Survey Datasets into a directory.

## In a terminal, use the unzip command:
## unzip ESS3e03_2.spss.zip

## unzip ESS2e03_1.spss.zip

## Then run the following in R.


library(foreign)

d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)


d2 <- read.spss("ESS3e03_2.por")
warnings()

### You can try to go into a data frame in one
### step, that's an option in read.spss. But
### we saw warnings, and wanted to be careful.

d2 <- as.data.frame(d2)
d2$whichSurvey <- 2

d3 <- read.spss("ESS2e03_1.por")

d3 <- as.data.frame(d3)
d3$whichSurvey <- 3

namesd2 <- names(d2)
namesd3 <- names(d3)

commonNames <- intersect( namesd3, namesd2)

combod23 <- rbind(d2[ , commonNames], d3[, commonNames])

save(combod23, file="combod23.Rda")


## Error
##Warning messages:
##1: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA,  :
##  invalid factor level, NAs generated
##2: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA,  :
##  invalid factor level, NAs generated
##3: In `[<-.factor`(`*tmp*`, ri, value = c(1, 1, 1, 1, 1, 1, 1, 1, 1,  :
##  invalid factor level, NAs generated

## That worries me a little bit. The warnings did too.

## Inspect a few lines in the result.

combod23[1:4, ]

## fix doesn't work for me, did not bother to investigate.

##> fix(combod23)
##Error in edit.data.frame(get(subx, envir = parent), title = subx, ...) :
##  can only handle vector and factor elements
## That means some data from hell came into this thing.

## I suspect that combod23 is OK.

## The memory use on this exercise is huge! Try to help it

rm (d2)
rm (d3)


## But I worry. I have 2 ways that I use to try to figure this
## out. One is to open the dataset in a clone of SPSS called
## "PSPP". Actually, the executable is "psppire".
##
## The other thing I do is open the same data again in
## a numeric format, and compare the 2 combined data frames

## This is also a useful exercise because it helps you
## understand what a "factor" is in R.

dn2 <- read.spss("ESS3e03_2.por", use.value.labels = F)


dn2 <- as.data.frame(dn2)
dn2$whichSurvey <- 2

dn3 <- read.spss("ESS2e03_1.por", use.value.labels = F)

dn3 <- as.data.frame(dn3)
dn3$whichSurvey <- 3

## Might be smart to compare
# dn2$HAPPY[1:50]
# d2$HAPPY[1:50]

namesdn2 <- names(dn2)
namesdn3 <- names(dn3)

commonNNames <- intersect( namesdn3, namesdn2 )

combodn23 <- rbind(dn2[ , commonNNames], dn3[, commonNNames])

save(combodn23, file="combodn23.Rda")

table( combod23$HAPPY, combodn23$HAPPY)

## In summary, whenever I want to use a variable from
## the combined data frame, I would probably compare
## against combodn23 just to feel safe.




## Note, after when you come back to work on this project again, you
## might as well just reload the saved copies of combod23 and
## combodn23.

## load("combod23.Rda")

## load("combodn23.Rda")

## That will put you at the current spot, no need to redo the merge


## Now, about "recoding". If you just want numerical
## data, you might consider using combodn23.

## But if you want some factors and some numberical
## variables, then you might need to recode to reclaim
## values.

## HAPPY turns out to be an interesting example of a
## PAIN IN THE ASS because in SPSS, it is scored from
## 0 to 10, but they give value labels only for scores
## 1=  Extremely unhappy
## and
## 10= Extremely happy
##
## And the SPSS column has no labels for values 1-9.
## If SPSS gave NO labels at all, then this would come
## into R as a numeric variable. BUT, because there are
## 2 levels named, then R makes a factor out of it.

## When R turns it into a factor, you
## end up with a nutty looking factor, which has
## levels you don't really appreciate.

levels(combod23$HAPPY)
# [1] "Extremely unhappy" "1"                 "2"
# [4] "3"                 "4"                 "5"
# [7] "6"                 "7"                 "8"
#[10] "9"                 "Extremely happy"   "Refusal"
#[13] "Don't know"        "No answer"



## Create a new variable to play with
combod23$HAPPY2 <- combod23$HAPPY

## Change Extremely Unhappy to text "0"
levels(combod23$HAPPY)[1] <- "0"
## Change Extremely Happy to "10"
levels(combod23$HAPPY)[11] <- "10"

HELL <- levels(combod23$HAPPY)

### Look at HELL

HELL

combod23$HAPPY2[combod23$HAPPY %in% HELL[12:14] ] <- NA

##CHECK RESULT
table(combod23$HAPPY, combod23$HAPPY2)


## Eliminate the unused levels from HAPPY2
combod23$HAPPY2 <- factor(combod23$HAPPY2)
### Same is found with
## combo23$HAPPY2 <- combo23$HAPPY2[ , drop=T]

## Use the "factor trick" to
## reset the variable back to numeric:

combod23$HAPPYN <- as.numeric(HELL)[combod23$HAPPYN]

##CHECK RESULT
table(combod23$HAPPY, combod23$HAPPY2)

## CHECK by comparing against numeric data from spss
 table(combodn23$HAPPY, combod23$HAPPYN)




## Next, a student asks "how can I make that same recode
## on a lot of variables?" I'm going to have to leave
## that one unanswered.  I think the answer will probably
## be to get a list of variables, then use "lapply" to
## do the same thing to each variable in turn.  But
## I have not written up a simple, understandable example
## yet



## After the data is all recoded and homogenized, then we
## could run any analysis we want, and throw in the variable
## "whichSurvey" to see if there is a difference beteween the
## two models.

## Example, choose your y and x1 and x2, then

## mod <- lm(y~ (x1+x2)*whichSurvey, data=combod23)

## or if you think the difference is just in the intercept:

## mod <- lm(y~ x1+x2 + whichSurvey, data=combod23)


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] spss imports--trouble with to.data.frame

Reply via email to