[Rd] Option "installWithVers" seems to impact new.packages() badly?

2005-09-30 Thread A.J. Rossini
In Rdevel, SVN version built this morning around 10am central european
time, it looks like

   install.packages(new.packages(),installWithVers=TRUE)

seem to ignore the version information -- that is, it reinstalls
current versions of packages.

This did not happen before I used "installWithVers=TRUE" option, that is
I could use

   update.packages()
 and
   install.packages(new.packages())

to keep a current, platform-complete, installed base of CRAN.

best,
-tony

[EMAIL PROTECTED]
Muttenz, Switzerland.
"Commit early,commit often, and commit in a repository from which we can easily
roll-back your mistakes" (AJR, 4Jan05).

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] bug in gsub with perl=TRUE (PR#8164)

2005-09-30 Thread Richard . Mott
Full_Name: Richard Mott
Version: 2.0.1
OS: Linux toad 2.6.9 #4 SMP Mon Feb 21 16:20:16 GMT 2005 x86_64 AMD Opteron(tm) 
Processor 848 AuthenticAMD GNU/Linux
Submission from: (NULL) (129.67.46.247)


gsub with perl=TRUE does not work properly. It pads/truncates the resulting
string to
the length of the input string: 

my.formula <- "log10(Biochem.ALP)^2+1 ~ Family + GENDER"

> gsub("^.+~", "transformed.y ~", my.formula )
[1] "transformed.y ~ Family + GENDER"

> gsub("^.+~", "transformed.y ~", my.formula, perl=TRUE )
[1] "transformed.y ~ Family + GENDER\0\006\0\0\r\377\0\0\0"  # padded

 my.formula <- "Biochem.ALP ~ Family + GENDER"
> gsub("^.+~", "transformed.y ~", my.formula, perl=TRUE )
[1] "transformed.y ~ Family + GEND"  # truncated
> gsub("^.+~", "transformed.y ~", my.formula )
[1] "transformed.y ~ Family + GENDER"

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] bug in gsub with perl=TRUE (PR#8164)

2005-09-30 Thread Uwe Ligges
[EMAIL PROTECTED] wrote:

> Full_Name: Richard Mott
> Version: 2.0.1

This version is completely outdated.
Please try with a *recent* version of R when reporting bugs, in this 
case R-2.2.0 beta (or in worst case R-2.1.1, the current release).

The bug reported below has been fixed some months ago ...

Uwe Ligges

> OS: Linux toad 2.6.9 #4 SMP Mon Feb 21 16:20:16 GMT 2005 x86_64 AMD 
> Opteron(tm) Processor 848 AuthenticAMD GNU/Linux
> Submission from: (NULL) (129.67.46.247)
> 
> 
> gsub with perl=TRUE does not work properly. It pads/truncates the resulting
> string to
> the length of the input string: 
> 
> my.formula <- "log10(Biochem.ALP)^2+1 ~ Family + GENDER"
> 
> 
>>gsub("^.+~", "transformed.y ~", my.formula )
> 
> [1] "transformed.y ~ Family + GENDER"
> 
> 
>>gsub("^.+~", "transformed.y ~", my.formula, perl=TRUE )
> 
> [1] "transformed.y ~ Family + GENDER\0\006\0\0\r\377\0\0\0"  # padded
> 
>  my.formula <- "Biochem.ALP ~ Family + GENDER"
> 
>>gsub("^.+~", "transformed.y ~", my.formula, perl=TRUE )
> 
> [1] "transformed.y ~ Family + GEND"  # truncated
> 
>>gsub("^.+~", "transformed.y ~", my.formula )
> 
> [1] "transformed.y ~ Family + GENDER"
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Subscripting fails if name of element is "" (PR#8161)

2005-09-30 Thread Jens Oehlschlägel
Dear all,

I resend this mail because it was blocked: I submitted a bug from the r-bug
webpage and hypatia seems to block mail that is send from a different IP
than that usually associated with the email. Looks like it is currently
impossible to correctly submit bugs from the website. However, here is the
original bug report:

(PR#8161)

Dear all,

The following shows cases where accessing elements via their name fails (if
the
name is a string of length zero). 

Best regards


Jens Oehlschlägel


> p <- 1:3
> names(p) <- c("a","", as.character(NA))
> p
   a   
   123 
> 
> for (i in names(p))
+ print(p[[i]])
[1] 1
[1] 2
[1] 3
> 
> # error 1: vector subsripting with "" fails in second element
> for (i in names(p))
+ print(p[i])
a 
1 
 
  NA 
 
   3 
> 
> # error 2: print method for list shows no name for second element
> p <- as.list(p)
> 
> 
> for (i in names(p))
+ print(p[[i]])
[1] 1
[1] 2
[1] 3
> 
> # error 3: list subsripting with "" fails in second element
> for (i in names(p))
+ print(p[i])
$a
[1] 1

$"NA"
NULL

$"NA"
[1] 3

> 
> version
 _  
platform i386-pc-mingw32
arch i386   
os   mingw32
system   i386, mingw32  
status  
major2  
minor1.1
year 2005   
month06 
day  20 
language R




# -- replication code --

p <- 1:3
names(p) <- c("a","", as.character(NA))
p

for (i in names(p))
 print(p[[i]])
 
# error 1: vector subsripting with "" fails in second element
for (i in names(p))
 print(p[i])

# error 2: print method for list shows no name for second element
p <- as.list(p)


for (i in names(p))
 print(p[[i]])
 
# error 3: list subsripting with "" fails in second element
for (i in names(p))
 print(p[i])




--

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Summary of translation status

2005-09-30 Thread Duncan Murdoch
On 9/28/2005 11:50 PM, Fernando Henrique Ferraz P. da Rosa wrote:
> Dear R-devel & Translation Teams,
> 
> In order to monitor the progress of the translation for the
> pt_BR team I wrote a script to summarize the status of the translations.
> It wasn't difficult to extend it to the other languages so I decided to
> set up a page with the summaries of the translation for all languages
> for which currently exist a translation. 
> 
> http://www.ime.usp.br/~feferraz/en/rtransstat.html
> 
> If any of you find it useful I can keep it updated on a regular basis
> (daily or weekly). 
> 
> 
> Thank you,
> 
> 
> (PS: I'm resending this message because it didn't get through the
> filter the first time. Sorry for the inconvenience for those that
> are receiving it more than one time).

Hi Fernando.  That's a nice page.  I'd add an explicit statement about 
which branch the statistics apply to.  You say "Statistics based on SVN: 
35706", presumably on the trunk, but soon interest will shift to the 
R-2-2-patches branch.  (If this is automated and you have the disk space 
for both, perhaps both trunk and the current patch branch could be 
listed, but I expect the statistics will be very similar.)

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Subscripting fails if name of element is "" (PR#8161)

2005-09-30 Thread Thomas Lumley

On Fri, 30 Sep 2005, "Jens Oehlschlägel" wrote:

Dear all,

The following shows cases where accessing elements via their name fails (if
the
name is a string of length zero).



This looks deliberate (there is a function NonNullStringMatch that does 
the matching).  I assume this is because there is no other way to 
indicate that an element has no name.


If so, it is a documentation bug -- help(names) and FAQ 7.14 should 
specify this behaviour.  Too late for 2.2.0, unfortunately.


-thomas






Best regards


Jens Oehlschlägel



p <- 1:3
names(p) <- c("a","", as.character(NA))
p

  a  
  123


for (i in names(p))

+ print(p[[i]])
[1] 1
[1] 2
[1] 3


# error 1: vector subsripting with "" fails in second element
for (i in names(p))

+ print(p[i])
a
1

 NA

  3


# error 2: print method for list shows no name for second element
p <- as.list(p)


for (i in names(p))

+ print(p[[i]])
[1] 1
[1] 2
[1] 3


# error 3: list subsripting with "" fails in second element
for (i in names(p))

+ print(p[i])
$a
[1] 1

$"NA"
NULL

$"NA"
[1] 3



version

_
platform i386-pc-mingw32
arch i386
os   mingw32
system   i386, mingw32
status
major2
minor1.1
year 2005
month06
day  20
language R




# -- replication code --

p <- 1:3
names(p) <- c("a","", as.character(NA))
p

for (i in names(p))
 print(p[[i]])

# error 1: vector subsripting with "" fails in second element
for (i in names(p))
 print(p[i])

# error 2: print method for list shows no name for second element
p <- as.list(p)


for (i in names(p))
 print(p[[i]])

# error 3: list subsripting with "" fails in second element
for (i in names(p))
 print(p[i])




--

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Compiling R on OpenSolaris

2005-09-30 Thread Vincent Yau
Hi:

I am trying to R (v2.1.1) to compile on OpenSolaris (build 22) using gcc
version 3.4.3.
But I am getting this error message:

gmake[5]: Entering directory `/usr/local/R-2.1.1/src/library/tools/src'
../../../../library/tools/libs/tools.so is unchanged
gmake[5]: Leaving directory `/usr/local/R-2.1.1/src/library/tools/src'
gmake[4]: Leaving directory `/usr/local/R-2.1.1/src/library/tools/src'
Error in dyn.load(x, as.logical(local), as.logical(now)) :
unable to load shared library '/usr/local/R-2.1.1
/library/tools/libs/tools.so':
ld.so.1: R: fatal: relocation error: R_AMD64_PC32: file
/usr/local/R-2.1.1/library/tools/libs/tools.so:
symbol main: value 0x28001413f04 does not fit
Execution halted
gmake[3]: *** [all] Error 1
gmake[3]: Leaving directory `/usr/local/R-2.1.1/src/library/tools'

I am on an Athlon64 box so I put in -m64 -mtune=athlon64 for the compiler
option.
Any help how I can fix this problem much appreciated.

thanks

---Vincent

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] by() processing on a dataframe

2005-09-30 Thread Duncan Murdoch
I want to calculate a statistic on a number of subgroups of a dataframe, 
then put the results into a dataframe.  (What SAS PROC MEANS does, I 
think, though it's been years since I used it.)

This is possible using by(), but it seems cumbersome and fragile.  Is 
there a more straightforward way than this?

Here's a simple example showing my current strategy:

 > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
c(2,2,2,2)), value = rnorm(8))
 > dataset
   gp1 gp2  value
1   1   1  0.9493232
2   1   1 -0.0474712
3   1   2 -0.6808021
4   1   2  1.9894999
5   2   3  2.0154786
6   2   3  0.4333056
7   2   4 -0.4746228
8   2   4  0.6017522
 >
 > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
 >
 > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
 >
 > result <- do.call('rbind', bylist)
 > result
gp1 gp2  statistic
11   1 0.45092598
11   1   2 0.65434890
12   2   3 1.22439210
13   2   4 0.06356469

tapply() is inappropriate because I don't have all possible combinations 
of gp1 and gp2 values, only some of them:

 > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
  1 23  4
1 0.450926 0.6543489   NA NA
2   NANA 1.224392 0.06356469



In the real case, I only have a very sparse subset of all the 
combinations, and tapply() and by() both die for lack of memory.

Any suggestions on how to do what I want, without using SAS?

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread hadley wickham
I'm not entirely sure what you want, but maybe this does the trick?

data.frame.by <- function(data, variables, fun, ...) {
if (length(variables) == 0 ) {
df <- data.frame(results = 0)
df$results <- list(fun(data$value, ...))
return(df)
}

sorted <- sort.df(data, variables)[,c(variables), drop=FALSE]
duplicates <- duplicated(sorted[,variables, drop=FALSE])
index <- cumsum(!duplicates)

results <- by(data, index, fun, ...)

cols <- sorted[!duplicates,variables, drop=FALSE]
cols$results <- array(results)
cols
}


sort.df <- function(data, vars) {
data[do.call("order", data[,vars, drop=FALSE]), ,drop=FALSE]
}


dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))

data.frame.by(dataset, c("gp1", "gp2"), function(data) mean(data$value))
data.frame.by(dataset, "gp1", function(data) tapply(data$value, data$gp2, mean))
data.frame.by(dataset, "gp1", function(data) lm(gp2 ~ value, data)) #
doesn't print, but everything is there ok

(note that the results column will be a list if necessary - this may
be a serious abuse of data frames, but I'm not sure and no one replied
when I queried the list)

Hadley

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread Peter Dalgaard
Duncan Murdoch <[EMAIL PROTECTED]> writes:

> I want to calculate a statistic on a number of subgroups of a dataframe, 
> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
> think, though it's been years since I used it.)
> 
> This is possible using by(), but it seems cumbersome and fragile.  Is 
> there a more straightforward way than this?
> 
> Here's a simple example showing my current strategy:
> 
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>gp1 gp2  value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
> gp1 gp2  statistic
> 11   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
> 
> tapply() is inappropriate because I don't have all possible combinations 
> of gp1 and gp2 values, only some of them:
> 
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>   1 23  4
> 1 0.450926 0.6543489   NA NA
> 2   NANA 1.224392 0.06356469
> 
> 
> 
> In the real case, I only have a very sparse subset of all the 
> combinations, and tapply() and by() both die for lack of memory.
> 
> Any suggestions on how to do what I want, without using SAS?

Have you tried aggregate()?

Alternatively, you migth split on interaction(, drop=TRUE)

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread Gabor Grothendieck
Check out summaryBy in the doBy package at:

   http://genetics.agrsci.dk/~sorenh/misc

e.g.

   summaryBy(value ~ gp1 + gp2, data = dataset)



On 9/30/05, Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> I want to calculate a statistic on a number of subgroups of a dataframe,
> then put the results into a dataframe.  (What SAS PROC MEANS does, I
> think, though it's been years since I used it.)
>
> This is possible using by(), but it seems cumbersome and fragile.  Is
> there a more straightforward way than this?
>
> Here's a simple example showing my current strategy:
>
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>   gp1 gp2  value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
>gp1 gp2  statistic
> 11   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
>
> tapply() is inappropriate because I don't have all possible combinations
> of gp1 and gp2 values, only some of them:
>
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>  1 23  4
> 1 0.450926 0.6543489   NA NA
> 2   NANA 1.224392 0.06356469
>
>
>
> In the real case, I only have a very sparse subset of all the
> combinations, and tapply() and by() both die for lack of memory.
>
> Any suggestions on how to do what I want, without using SAS?
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread Marc Schwartz (via MN)
On Fri, 2005-09-30 at 13:22 -0400, Duncan Murdoch wrote:
> I want to calculate a statistic on a number of subgroups of a dataframe, 
> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
> think, though it's been years since I used it.)
> 
> This is possible using by(), but it seems cumbersome and fragile.  Is 
> there a more straightforward way than this?
> 
> Here's a simple example showing my current strategy:
> 
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>gp1 gp2  value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
> gp1 gp2  statistic
> 11   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
> 
> tapply() is inappropriate because I don't have all possible combinations 
> of gp1 and gp2 values, only some of them:
> 
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>   1 23  4
> 1 0.450926 0.6543489   NA NA
> 2   NANA 1.224392 0.06356469
> 
> 
> 
> In the real case, I only have a very sparse subset of all the 
> combinations, and tapply() and by() both die for lack of memory.
> 
> Any suggestions on how to do what I want, without using SAS?
> 
> Duncan Murdoch

Duncan,

Does this do what you want?

> set.seed(1)
 
> df <- data.frame(gp1 = rep(1:2, c(4,4)), 
   gp2 = rep(1:4, c(2,2,2,2)), 
   value = rnorm(8))
 
> df
  gp1 gp2  value
1   1   1 -0.6264538
2   1   1  0.1836433
3   1   2 -0.8356286
4   1   2  1.5952808
5   2   3  0.3295078
6   2   3 -0.8204684
7   2   4  0.4874291
8   2   4  0.7383247

> means <- aggregate(df$value, list(gp1 = df$gp1, gp2 = df$gp2), mean)
 
> means
  gp1 gp2  x
1   1   1 -0.2214052
2   1   2  0.3798261
3   2   3 -0.2454803
4   2   4  0.6128769


> merge(df, means, by = c("gp1", "gp2"))
  gp1 gp2  value  x
1   1   1 -0.6264538 -0.2214052
2   1   1  0.1836433 -0.2214052
3   1   2 -0.8356286  0.3798261
4   1   2  1.5952808  0.3798261
5   2   3  0.3295078 -0.2454803
6   2   3 -0.8204684 -0.2454803
7   2   4  0.4874291  0.6128769
8   2   4  0.7383247  0.6128769


HTH,

Marc Schwartz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread Duncan Murdoch
On 9/30/2005 1:41 PM, Peter Dalgaard wrote:
> Duncan Murdoch <[EMAIL PROTECTED]> writes:
> 
>> I want to calculate a statistic on a number of subgroups of a dataframe, 
>> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
>> think, though it's been years since I used it.)
>> 
>> This is possible using by(), but it seems cumbersome and fragile.  Is 
>> there a more straightforward way than this?
>> 
>> Here's a simple example showing my current strategy:
>> 
>>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
>> c(2,2,2,2)), value = rnorm(8))
>>  > dataset
>>gp1 gp2  value
>> 1   1   1  0.9493232
>> 2   1   1 -0.0474712
>> 3   1   2 -0.6808021
>> 4   1   2  1.9894999
>> 5   2   3  2.0154786
>> 6   2   3  0.4333056
>> 7   2   4 -0.4746228
>> 8   2   4  0.6017522
>>  >
>>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
>> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>>  >
>>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>>  >
>>  > result <- do.call('rbind', bylist)
>>  > result
>> gp1 gp2  statistic
>> 11   1 0.45092598
>> 11   1   2 0.65434890
>> 12   2   3 1.22439210
>> 13   2   4 0.06356469
>> 
>> tapply() is inappropriate because I don't have all possible combinations 
>> of gp1 and gp2 values, only some of them:
>> 
>>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>>   1 23  4
>> 1 0.450926 0.6543489   NA NA
>> 2   NANA 1.224392 0.06356469
>> 
>> 
>> 
>> In the real case, I only have a very sparse subset of all the 
>> combinations, and tapply() and by() both die for lack of memory.
>> 
>> Any suggestions on how to do what I want, without using SAS?
> 
> Have you tried aggregate()?

aggregate() has a few problems:

  - it applies the function to every column in the dataframe.  In my 
case it only makes sense to apply it to some of them.  (This may not be 
a killer, but it certainly makes things inefficient and tricky.)
  - I'd like to look at the whole subset to figure out the function (but 
I can probably work around this)
  - It uses too much memory.  E.g. try

 > df <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000), 
w=rnorm(1000))
 > aggregate(df, list(df$x,df$y,df$z), mean)
Error: cannot allocate vector of size 3906250 Kb
In addition: Warning messages:
1: Reached total allocation of 1007Mb: see help(memory.size)
2: Reached total allocation of 1007Mb: see help(memory.size)

This should have returned the same dataframe (there are 1000 subsets), 
but it tried to construct a billion of them.

On 9/30/2005 1:48 PM, Don MacQueen wrote:
 > Look at the summarize() function in the Hmisc package.

It seems to want a matrix, not a data.frame.  The real situation has 
mixed types (character, factors, numeric) so it can't be a matrix.

 > (and I this is an r-help question, not an r-devel question, I would 
think)

Yes, that's where I should have posted.  Sorry.  However, this is 
starting to look like a development problem...

Peter again:

> Alternatively, you migth split on interaction(, drop=TRUE)

Looking at the code, it appears that will construct the full product 
interaction, then subset to the non-empty cases... Yes, it does that.

Looks like I'll have to write my own.

Duncan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread Duncan Murdoch
On 9/30/2005 1:41 PM, hadley wickham wrote:
> I'm not entirely sure what you want, but maybe this does the trick?
> 
> data.frame.by <- function(data, variables, fun, ...) {
>   if (length(variables) == 0 ) {
>   df <- data.frame(results = 0)
>   df$results <- list(fun(data$value, ...))
>   return(df)
>   }
> 
>   sorted <- sort.df(data, variables)[,c(variables), drop=FALSE]
>   duplicates <- duplicated(sorted[,variables, drop=FALSE])
>   index <- cumsum(!duplicates)
> 
>   results <- by(data, index, fun, ...)
> 
>   cols <- sorted[!duplicates,variables, drop=FALSE]
>   cols$results <- array(results)
>   cols
> }
> 
> 
> sort.df <- function(data, vars) {
>   data[do.call("order", data[,vars, drop=FALSE]), ,drop=FALSE]
> }
> 
> 
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
> c(2,2,2,2)), value = rnorm(8))
> 
> data.frame.by(dataset, c("gp1", "gp2"), function(data) mean(data$value))
> data.frame.by(dataset, "gp1", function(data) tapply(data$value, data$gp2, 
> mean))
> data.frame.by(dataset, "gp1", function(data) lm(gp2 ~ value, data)) #
> doesn't print, but everything is there ok
> 
> (note that the results column will be a list if necessary - this may
> be a serious abuse of data frames, but I'm not sure and no one replied
> when I queried the list)

I think this should work.  Thanks!

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] by() processing on a dataframe

2005-09-30 Thread Gabor Grothendieck
And here is one more approach using the reshape package:

library(reshape)

dataset.d <- melt(dataset, id = 1:2)
cast(dataset.d, gp1 + gp2 ~ variable, mean)


On 9/30/05, Gabor Grothendieck <[EMAIL PROTECTED]> wrote:
> Check out summaryBy in the doBy package at:
>
>   http://genetics.agrsci.dk/~sorenh/misc
>
> e.g.
>
>   summaryBy(value ~ gp1 + gp2, data = dataset)
>
>
>
> On 9/30/05, Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> > I want to calculate a statistic on a number of subgroups of a dataframe,
> > then put the results into a dataframe.  (What SAS PROC MEANS does, I
> > think, though it's been years since I used it.)
> >
> > This is possible using by(), but it seems cumbersome and fragile.  Is
> > there a more straightforward way than this?
> >
> > Here's a simple example showing my current strategy:
> >
> >  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
> > c(2,2,2,2)), value = rnorm(8))
> >  > dataset
> >   gp1 gp2  value
> > 1   1   1  0.9493232
> > 2   1   1 -0.0474712
> > 3   1   2 -0.6808021
> > 4   1   2  1.9894999
> > 5   2   3  2.0154786
> > 6   2   3  0.4333056
> > 7   2   4 -0.4746228
> > 8   2   4  0.6017522
> >  >
> >  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> > + gp2 = subset$gp2[1], statistic = mean(subset$value))
> >  >
> >  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
> >  >
> >  > result <- do.call('rbind', bylist)
> >  > result
> >gp1 gp2  statistic
> > 11   1 0.45092598
> > 11   1   2 0.65434890
> > 12   2   3 1.22439210
> > 13   2   4 0.06356469
> >
> > tapply() is inappropriate because I don't have all possible combinations
> > of gp1 and gp2 values, only some of them:
> >
> >  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
> >  1 23  4
> > 1 0.450926 0.6543489   NA NA
> > 2   NANA 1.224392 0.06356469
> >
> >
> >
> > In the real case, I only have a very sparse subset of all the
> > combinations, and tapply() and by() both die for lack of memory.
> >
> > Any suggestions on how to do what I want, without using SAS?
> >
> > Duncan Murdoch
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Summary of translation status

2005-09-30 Thread Fernando Henrique Ferraz P. da Rosa
Duncan Murdoch writes:
> 
> Hi Fernando.  That's a nice page.  I'd add an explicit statement about 
> which branch the statistics apply to.  You say "Statistics based on SVN: 
> 35706", presumably on the trunk, but soon interest will shift to the 
> R-2-2-patches branch.  (If this is automated and you have the disk space 
> for both, perhaps both trunk and the current patch branch could be 
> listed, but I expect the statistics will be very similar.)
> 
> Duncan Murdoch

Hum that's true. I'm using the trunk branch. The process is
somewhat automated, but I'd have to keep a working copy of the patch
branch in order to run the status on it. Anyways I also think the statistics
will probably be the same. Everytime I submit a translation I see it appearing
on trunk so I think it won't matter much.


--
Fernando Henrique Ferraz P. da Rosa
http://www.feferraz.net

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel