[R] Transform a list of multiple to a data.frame which I want

2015-02-01 Thread Yao He
Dear all:

I have a list like that,which is a standard str_locate_all() function 
(stringr package) output:
$K
   start end
$GSEGTCSCSSK
   start end
[1,] 6   6
[2,] 8   8
$GFSTTCPAHVDDLTPEQVLDGDVNELMDVVLHHVPEAK
   start end
[1,] 6   6
$LVECIGQELIFLLPNK
   start end
[1,] 4   4
$NFK
   start end
$HR
   start end
$AYASLFR
   start end

I want to transform this list like that:

ID   start.1  start.2 
K   NA  NA
GSEGTCSCSSK 6 8
GFSTTCPAHVDDLTPEQVLDGDVNELMDVVLHHVPEAK 6 NA
LVECIGQELIFLLPNK 4 NA
NFK NA NA
HR NA NA
AYASLFR NA NA

I have already tried to use t() , lapply() but I think it is hard to handle the 
NA value and different rows in every matrix 

Thanks in advance

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Is that an efficient way to find the overlapped , upstream and downstream rangess for a bunch of rangess

2016-04-05 Thread Yao He
I do have a bunch of genes ( nearly ~5)  from the whole genome, which
read in genomic ranges

A range(gene) can be seem as an observation has three columns chromosome,
start and end, like that

   seqnames start end width strand

gene1 chr1 1   5 5  +

gene2 chr110  15 6  +

gene3 chr112  17 6  +

gene4 chr120  25 6  +

gene5 chr130  4011  +

I just wondering is there an efficient way to find *overlapped, upstream
and downstream genes for each gene in the granges*

For example, assuming all_genes_gr is a ~5 genes genomic range, the
result I want like belows:
gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA

Currently ,  the strategy I use is like that,

library(GenomicRanges)

find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}​

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx,
all_genes_gr)))

However, for 100 genes, it use nearly ~8s by system.time().That means if I
had 5 genes, nearly one hour for just find overlapped gene.

I am just wondering any algorithm or strategy to do that efficiently,
perhaps 5 genes in ~10min or even less

Yao He

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] how to read different files into different objects in one time?

2012-12-19 Thread Yao He
Dear All

I have a lot of files in a directory as follows:
"02-03.txt"   "03-04.txt"   "04-05.txt"   "05-06.txt"   "06-07.txt"
"07-08.txt"   "08-09.txt"
 "09-10.txt"   "G0.txt"  "G1.txt"  "raw_ped.txt"
..

I want to read them into different objects according to their filenames,such as:
02-03<-read.table("02-03.txt",header=T)
03-04<-read.table("03-04.txt",header=T)
I don't want to type hundreds of read.table(),so how I read it in one time?
I think the core problem is that I can't create different objects'
name in the use of loop or sapply() ,but there may be a better way to
do what I want.

Thanks a lot

Yao He

Yao He


-- 
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to aggregate T-test result in an elegant way?

2013-01-06 Thread Yao He
Dear all:

Plan 1:
I want to do serval t-test means for different variables in a loop ,
so I want to add all results to an object then  dump() them to an
text. But I don't know how to append T-test result to the object?

I have already plot the barplot and I want to know an elegant way to
report raw result.
Can anybody give me some pieces of advice?

Yao He
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to aggregate T-test result in an elegant way?

2013-01-06 Thread Yao He
Thank you,it is really helpful everytime.

I didn't provide any example data because I thought it is just a
question of how to report t.test() result in R.
However,as you say,it is better to show more details for finding an elegant way

In fact  I generate a 3-dimension array like that:
str(a)
 num [1:2, 1:245, 1:3] 47.5 NA 48.9 NA 47.5 ...
 - attr(*, "dimnames")=List of 3
  ..$ : chr [1:2] "13%" "21%"
  ..$ : chr [1:245] "TWF2H101" "TWF2H105" "TWF2H106" "TWF2H110" ...
  ..$ : chr [1:3] "EW.INCU" "EW.17.5" "EMW"

I want to do two sample mean t-test between 13% and 21% for each
variable "EW.INCU" "EW.17.5" "EMW".

So I try these codes:
variable<-dimnames(a)[[3]]
  O2<-dimnames(a)[[1]]
  for (i in variable) {
print(i)
print(O2[1])
print(O2[2])
print(t.test(a[O2[1],,i],a[O2[2],,i],na.rm=T))
}

I don't think it is an elegant way and I am inexperience to report raw result.
Could you give me more help?

Yao He

2013/1/7 arun :
> Hi,
> You didn't provide any example data.  So, I am not sure whether this helps.
>
> set.seed(15)
> dat1<-data.frame(A=sample(10:20,5,replace=TRUE),B=sample(18:28,5,replace=TRUE),C=sample(25:35,5,replace=TRUE),D=sample(20:30,5,replace=TRUE))
>  res<-lapply(lapply(seq_len(ncol(dat2)),function(i) 
> t.test(dat2[,i],dat1[,1],paired=TRUE)),function(x) 
> data.frame(meanDiff=x$estimate,p.value=x$p.value))# paired
> names(res)<-paste("A",LETTERS[2:4],sep="")
> res<- do.call(rbind,res)
> res
>   # meanDiff p.value
> #AB  9.4 0.021389577
> #AC 15.0 0.002570261
> #AD 10.6 0.003971604
>
>
> #or
> res1<-lapply(lapply(seq_len(ncol(dat2)),function(i) 
> t.test(dat2[,i],dat1[,1],paired=FALSE)),function(x) 
> data.frame(mean=x$estimate,p.value=x$p.value))
> names(res1)<-paste("A",LETTERS[2:4],sep="")
> res1<-do.call(rbind,res1)
> row.names(res1)[grep("mean of 
> y",row.names(res1))]<-gsub("(.*\\.).*","\\1A",row.names(res1)[grep("mean of 
> y",row.names(res1))])
> row.names(res1)[grep("mean of 
> x",row.names(res1))]<-gsub("(\\w)(\\w)(\\.).*","\\1\\2\\3\\2",row.names(res1)[grep("mean
>  of x",row.names(res1))])
> res1
> # mean  p.value
> #AB.B 25.2 1.299192e-03
> #AB.A 15.8 1.299192e-03
> #AC.C 30.8 5.145519e-05
> #AC.A 15.8 5.145519e-05
> #AD.D 26.4 1.381339e-03
> #AD.A 15.8 1.381339e-03
>
>
> A.K.
>
>
>
> - Original Message -
> From: Yao He 
> To: r-help@r-project.org
> Cc:
> Sent: Sunday, January 6, 2013 10:20 PM
> Subject: [R] how to aggregate T-test result in an elegant way?
>
> Dear all:
>
> Plan 1:
> I want to do serval t-test means for different variables in a loop ,
> so I want to add all results to an object then  dump() them to an
> text. But I don't know how to append T-test result to the object?
>
> I have already plot the barplot and I want to know an elegant way to
> report raw result.
> Can anybody give me some pieces of advice?
>
> Yao He
> ―
> Master candidate in 2rd year
> Department of Animal genetics & breeding
> Room 436,College of Animial Science&Technology,
> China Agriculture University,Beijing,100193
> E-mail: yao.h.1...@gmail.com
> ――
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
―
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
――

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to aggregate T-test result in an elegant way?

2013-01-07 Thread Yao He
Hi, arun
I'm so sorry for that isn't helpful.
One of question is that I don't know how  to subset a small part as it
is a 3-dimension array so I just show the structure of that.
 I tried  dput()  to a file , then what should I do for subsetting it?

Another question is :
My rawdata is a "melt" dataframe like that:
IID O2  variablevalue
1   TWF2H5  13% EW.INCU 49.38
2   TWF2H6  13% EW.INCU 48.02
3   TWF2H19 13%  EW.INCU51.44
280 TWF2H10113% EW.17.5 42.26
281 TWF2H10513%  EW.17.5 43.52
282 TWF2H10613% EW.17.5 42.83
472 TWF2N10221% EW.17.5 45.97
473 TWF2N10421%  EW.17.543.32
474 TWF2N10621% EW.17.5 48.63
689 TWF2N2  21%  EMW19.57
690 TWF2N6  21%  EMW18.07
691 TWF2N10 21% EMW 15.4
491 TWF2H5   13%EMW 15.61
492 TWF2H6  13% EMW 13.41
493 TWF2H19 13% EMW 14.03
199 TWF2N2  21% EW.INCU 48.69
200 TWF2N6  21% EW.INCU 50.52
201 TWF2N10 21% EW.INCU 42.04

if you meet a t-test task as I described  , is that generate a
high-dimension array  a good way ?
Thank you!

Yao He
2013/1/7 arun :
> HI,
> I tried to create an example dataset (as you didn't provide the data).
> set.seed(25)
> a<-array(sample(1:50,60,replace=TRUE),dim=c(2,10,3))
> dimnames(a)[[1]]<-c("13%","21%")
> dimnames(a)[[2]]<-paste("TWF2H",101:110,sep="")
> dimnames(a)[[3]]<-c("EW.INCU","EW.17.5","EMW")
>
>
> str(a)
> # int [1:2, 1:10, 1:3] 21 35 8 45 7 50 32 17 4 15 ...
>  #- attr(*, "dimnames")=List of 3
>   #..$ : chr [1:2] "13%" "21%"
>   .#.$ : chr [1:10] "TWF2H101" "TWF2H102" "TWF2H103" "TWF2H104" ...
>   #..$ : chr [1:3] "EW.INCU" "EW.17.5" "EMW"
>
> res<-lapply(lapply(seq_len(dim(a)[3]),function(i) 
> t.test(a[dimnames(a)[[1]][1],,i],a[dimnames(a)[[1]][2],,i])),function(x) 
> data.frame(mean=x$estimate,p.value=x$p.value))
> res1<-do.call(rbind,res)
>   row.names(res1)[grep("mean of 
> x",row.names(res1))]<-gsub("(.*\\.).*$","\\113%",row.names(res1)[grep("mean 
> of x",row.names(res1))])
>  row.names(res1)[grep("mean of 
> y",row.names(res1))]<-gsub("(.*\\.).*$","\\121%",row.names(res1)[grep("mean 
> of y",row.names(res1))])
> res1
> #mean   p.value
> #EW.INCU.13% 22.3 0.2754842
> #EW.INCU.21% 29.3 0.2754842
> #EW.17.5.13% 20.5 0.4705772
> #EW.17.5.21% 16.0 0.4705772
> #EMW.13% 23.9 0.9638679
> #EMW.21% 24.2 0.9638679
> A.K.
>
>
>
>
> - Original Message -
> From: Yao He 
> To: arun 
> Cc: R help 
> Sent: Sunday, January 6, 2013 11:21 PM
> Subject: Re: [R] how to aggregate T-test result in an elegant way?
>
> Thank you,it is really helpful everytime.
>
> I didn't provide any example data because I thought it is just a
> question of how to report t.test() result in R.
> However,as you say,it is better to show more details for finding an elegant 
> way
>
> In fact  I generate a 3-dimension array like that:
> str(a)
> num [1:2, 1:245, 1:3] 47.5 NA 48.9 NA 47.5 ...
> - attr(*, "dimnames")=List of 3
>   ..$ : chr [1:2] "13%" "21%"
>   ..$ : chr [1:245] "TWF2H101" "TWF2H105" "TWF2H106" "TWF2H110" ...
>   ..$ : chr [1:3] "EW.INCU" "EW.17.5" "EMW"
>
> I want to do two sample mean t-test between 13% and 21% for each
> variable "EW.INCU" "EW.17.5" "EMW".
>
> So I try these codes:
> variable<-dimnames(a)[[3]]
>   O2<-dimnames(a)[[1]]
>   for (i in variable) {
> print(i)
> print(O2[1])
> print(O2[2])
> print(t.test(a[O2[1],,i],a[O2[2],,i],na.rm=T))
> }
>
> I don't think it is an elegant way and I am inexperience to report raw result.
> Could you give me more help?
>
> Yao He
>
> 2013/1/7 arun :
>> Hi,
>> You didn't provide any example data.  So, I am not sure whether this helps.
>>
>> set.seed(15)
>> dat1<-data.frame(A=sample(10:20,5,replace=TRUE),B=sample(18:28,5,replace=TRUE),C=sample(25:35,5,replace=TRUE),D=sample(20:30,5,replace=TRUE))
>>  res<-lapply(lapply(seq_len(ncol(dat2)),function(i) 
>> t.test(dat2[,i],dat1[,1],paired=TRUE)),function(x) 
>> data.frame(meanDiff=x$estimate,p.value=x$p.value))# paired
>> names(res)<-paste("A",LETTERS[2:4],sep="")
>> res<- do.call(rbi

Re: [R] how to aggregate T-test result in an elegant way?

2013-01-07 Thread Yao He
Hi,arun

Yes , I just want to do the t.test
I think maybe  it is not necessary to generate a 3D array from the raw
data.frame by acast() at first

Thanks a lot

2013/1/7 arun :
> Hi Yao,
>
> It's okay.
>
> How did you generate the 3 D array?
> Using ?acast()
>
> I am not sure I understand your question "
>
> if you meet a t-test task as I described  , is that generate a
> high-dimension array  a good way ?"
>
> Do you want to do the t-test in the melt dataset?
>
> b<- read.table(text="
> IDO2variablevalue
> 1TWF2H513% EW.INCU49.38
> 2TWF2H613% EW.INCU48.02
> 3TWF2H1913%EW.INCU51.44
> 280TWF2H10113% EW.17.542.26
> 281TWF2H10513%EW.17.543.52
> 282TWF2H10613% EW.17.542.83
> 472TWF2N10221% EW.17.545.97
> 473TWF2N10421%EW.17.5 43.32
> 474TWF2N10621% EW.17.548.63
> 689TWF2N221% EMW19.57
> 690TWF2N621%EMW18.07
> 691TWF2N1021%EMW15.4
> 491TWF2H513%EMW15.61
> 492TWF2H613%EMW13.41
> 493TWF2H1913%EMW14.03
> 199TWF2N221%EW.INCU48.69
> 200TWF2N621%EW.INCU50.52
> 201TWF2N1021%EW.INCU42.04
> ",sep="",header=TRUE,stringsAsFactors=FALSE)
>  res<-lapply(lapply(split(b,b$variable),function(x) 
> t.test(x$value[x$O2=="13%"],x$value[x$O2=="21%"])),function(x) 
> data.frame(mean=x$estimate,p.value=x$p.value))
> res1<-do.call(rbind,res)
> row.names(res1)[grep("mean of 
> x",row.names(res1))]<-gsub("(.*\\.).*$","\\113%",row.names(res1)[grep("mean 
> of x",row.names(res1))])
> row.names(res1)[grep("mean of 
> y",row.names(res1))]<-gsub("(.*\\.).*$","\\121%",row.names(res1)[grep("mean 
> of y",row.names(res1))])
> res1
> #        meanp.value
> #EMW.13% 14.35000 0.09355374
> #EMW.21% 17.68000 0.09355374
> #EW.17.5.13% 42.87000 0.17464018
> #EW.17.5.21% 45.97333 0.17464018
> #EW.INCU.13% 49.61333 0.43689727
> #EW.INCU.21% 47.08333 0.43689727
>
> A.K.
>
>
>
> - Original Message -
> From: Yao He 
> To: arun 
> Cc: R help 
> Sent: Monday, January 7, 2013 4:00 AM
> Subject: Re: [R] how to aggregate T-test result in an elegant way?
>
> Hi, arun
> I'm so sorry for that isn't helpful.
> One of question is that I don't know how  to subset a small part as it
> is a 3-dimension array so I just show the structure of that.
> I tried  dput()  to a file , then what should I do for subsetting it?
>
> Another question is :
> My rawdata is a "melt" dataframe like that:
> IIDO2variablevalue
> 1TWF2H513% EW.INCU49.38
> 2TWF2H613% EW.INCU48.02
> 3TWF2H1913% EW.INCU51.44
> 280TWF2H10113% EW.17.542.26
> 281TWF2H10513% EW.17.5 43.52
> 282TWF2H10613% EW.17.542.83
> 472TWF2N10221% EW.17.545.97
> 473TWF2N10421% EW.17.5 43.32
> 474TWF2N10621% EW.17.548.63
> 689TWF2N221%  EMW19.57
> 690TWF2N621% EMW    18.07
> 691TWF2N1021%EMW15.4
> 491TWF2H5 13%EMW15.61
> 492TWF2H613%EMW13.41
> 493TWF2H1913%EMW14.03
> 199TWF2N221%EW.INCU48.69
> 200TWF2N621%EW.INCU50.52
> 201TWF2N1021%EW.INCU42.04
>
> if you meet a t-test task as I described  , is that generate a
> high-dimension array  a good way ?
> Thank you!
>
> Yao He
> 2013/1/7 arun :
>> HI,
>> I tried to create an example dataset (as you didn't provide the data).
>> set.seed(25)
>> a<-array(sample(1:50,60,replace=TRUE),dim=c(2,10,3))
>> dimnames(a)[[1]]<-c("13%","21%")
>> dimnames(a)[[2]]<-paste("TWF2H",101:110,sep="")
>> dimnames(a)[[3]]<-c("EW.INCU","EW.17.5","EMW")
>>
>>
>> str(a)
>> # int [1:2, 1:10, 1:3] 21 35 8 45 7 50 32 17 4 15 ...
>>  #- attr(*, "dimnames")=List of 3
>>   #..$ : chr [1:2] "13%" "21%"
>>   .#.$ : chr [1:10] "TWF2H101" "TWF2H102" "TWF2H103" "TWF2H104" ...
>>   #..$ : chr [1:3] "EW.INCU" "EW.17.5" "EMW"
>>
>> res<-lapply(lapply(seq_len(dim(a)[3]),function(i) 
>> t.test(a[dimnames(a)[[1]][1],,i],a[dimnames(a)[[1]]

Re: [R] how to aggregate T-test result in an elegant way?

2013-01-07 Thread Yao He
Yes, thanks a lot for your help!

Regards

2013/1/8 arun :
> Hi Yao,
>
> You could also have the results in a wide format:
> res<-do.call(rbind,lapply(lapply(split(b,b$variable),function(x) 
> t.test(x$value[x$O2=="13%"],x$value[x$O2=="21%"])),function(x) 
> data.frame(mean13=x$estimate[1],mean21=x$estimate[2],p.value=x$p.value,CILow=x$conf.int[1],CIHigh=x$conf.int[2])))
>  res
> #  mean13   mean21p.value CILowCIHigh
> #EMW 14.35000 17.68000 0.09355374 -7.682686  1.022686
> #EW.17.5 42.87000 45.97333 0.17464018 -9.265622  3.058955
> #EW.INCU 49.61333 47.08333 0.43689727 -7.119234 12.179234
> A.K.
>
>
>
>
> - Original Message -
> From: Yao He 
> To: arun 
> Cc: R help 
> Sent: Monday, January 7, 2013 10:57 AM
> Subject: Re: [R] how to aggregate T-test result in an elegant way?
>
> Hi,arun
>
> Yes , I just want to do the t.test
> I think maybe  it is not necessary to generate a 3D array from the raw
> data.frame by acast() at first
>
> Thanks a lot
>
> 2013/1/7 arun :
>> Hi Yao,
>>
>> It's okay.
>>
>> How did you generate the 3 D array?
>> Using ?acast()
>>
>> I am not sure I understand your question "
>>
>> if you meet a t-test task as I described  , is that generate a
>> high-dimension array  a good way ?"
>>
>> Do you want to do the t-test in the melt dataset?
>>
>> b<- read.table(text="
>> IDO2variablevalue
>> 1TWF2H513% EW.INCU49.38
>> 2TWF2H613% EW.INCU48.02
>> 3TWF2H1913%EW.INCU51.44
>> 280TWF2H10113% EW.17.542.26
>> 281TWF2H10513%EW.17.543.52
>> 282TWF2H10613% EW.17.542.83
>> 472TWF2N10221% EW.17.545.97
>> 473TWF2N10421%EW.17.5 43.32
>> 474TWF2N10621% EW.17.548.63
>> 689TWF2N221% EMW19.57
>> 690TWF2N621%EMW18.07
>> 691TWF2N1021%EMW15.4
>> 491TWF2H513%EMW15.61
>> 492TWF2H613%EMW13.41
>> 493TWF2H1913%EMW14.03
>> 199TWF2N221%EW.INCU48.69
>> 200TWF2N621%EW.INCU50.52
>> 201TWF2N1021%EW.INCU42.04
>> ",sep="",header=TRUE,stringsAsFactors=FALSE)
>>  res<-lapply(lapply(split(b,b$variable),function(x) 
>> t.test(x$value[x$O2=="13%"],x$value[x$O2=="21%"])),function(x) 
>> data.frame(mean=x$estimate,p.value=x$p.value))
>> res1<-do.call(rbind,res)
>> row.names(res1)[grep("mean of 
>> x",row.names(res1))]<-gsub("(.*\\.).*$","\\113%",row.names(res1)[grep("mean 
>> of x",row.names(res1))])
>> row.names(res1)[grep("mean of 
>> y",row.names(res1))]<-gsub("(.*\\.).*$","\\121%",row.names(res1)[grep("mean 
>> of y",row.names(res1))])
>> res1
>> #meanp.value
>> #EMW.13% 14.35000 0.09355374
>> #EMW.21% 17.68000 0.09355374
>> #EW.17.5.13% 42.87000 0.17464018
>> #EW.17.5.21% 45.97333 0.17464018
>> #EW.INCU.13% 49.61333 0.43689727
>> #EW.INCU.21% 47.08333 0.43689727
>>
>> A.K.
>>
>>
>>
>> - Original Message -
>> From: Yao He 
>> To: arun 
>> Cc: R help 
>> Sent: Monday, January 7, 2013 4:00 AM
>> Subject: Re: [R] how to aggregate T-test result in an elegant way?
>>
>> Hi, arun
>> I'm so sorry for that isn't helpful.
>> One of question is that I don't know how  to subset a small part as it
>> is a 3-dimension array so I just show the structure of that.
>> I tried  dput()  to a file , then what should I do for subsetting it?
>>
>> Another question is :
>> My rawdata is a "melt" dataframe like that:
>> IIDO2variablevalue
>> 1TWF2H513% EW.INCU49.38
>> 2TWF2H613% EW.INCU48.02
>> 3TWF2H1913% EW.INCU51.44
>> 280TWF2H10113% EW.17.5    42.26
>> 281TWF2H10513% EW.17.5 43.52
>> 282TWF2H10613% EW.17.542.83
>> 472TWF2N10221% EW.17.545.97
>> 473TWF2N10421% EW.17.5 43.32
>> 474TWF2N10621% EW.17.548.63
>> 689TWF2N221%  EMW19.57
>> 690TWF2N621% EMW18.07
>> 691TWF2N1021%EMW15.4
>> 491TWF2H5 13%EMW15.61
>> 492TWF2H613%EMW13

Re: [R] ggplot not showing all the years on the x-axis

2013-01-08 Thread Yao He
Hi,this is a question about how to set the scale,try this
add a scale_x_discrete() like that:

plot <- tmpplot + geom_line()+scale_x_continuous(breaks=ii)


Yao He


2013/1/8 Francesco Sarracino :
> Dear R helpers,
>
> I am currently having hard time fixing the values on the x-axis of a plot
> with ggplot: even though I have 12 years, ggplot plots only 3 of them.
> Here is my example:
>
> library(ggplot2)
> ii <- 2000:2011
> ss <- rnorm(12,0,1)
> pm <- data.frame(ii,ss)
> tmpplot <- ggplot(pm, aes(x = ii, y = ss))
> plot <- tmpplot + geom_line()
> plot
>
> In my case, ggplot reports on the year 2000, 2004 and 2008 on the x-axis,
> but I'd like to have all the years from 2000 to 2011. I know how to fix
> this with the standard plot in R, but for consistency I'd like to use
> ggplot.
> Can anyone help?
> thanks in advance,
> f.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to count "A", "C", "T", "G" in each row in a big data.frame?

2013-01-09 Thread Yao He
t;, "GG",
> "GA", "GG", "TT", "CC", "GA", "CT", "AA", "AA", "AG"), X2570 = c("AA",
> "CT", "TT", "CC", "CT", "CC", "CC", "TT", "CC", "GG", "GG",
> "GG", "GG", "TT", "TC", "GG", "CC", "AA", "AA", "GG"), X2476 = c("AA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GG",
> "GG", "GG", "GT", "TC", "AG", "CC", "AA", "AA", "AG"), X2534 = c("GA",
> "TC", "TT", "CC", "TC", "CC", "CC", "TT", "CC", "GG", "GA",
> "AG", "GG", "TG", "CC", "AG", "TC", "AA", "AA", "AA"), X2280 = c("AA",
> "TC", "TT", "CC", "TC", "CC", "CC", "TT", "CC", "GG", "AG",
> "AG", "GG", "TT", "CC", "GG", "CC", "AA", "AA", "AG"), X2316 = c("AA",
> "CC", "TT", "CC", "CC", "CC", "CC", "TT", "CC", "AG", "AA",
> "AA", "AG", "TT", "TC", "GG", "CT", "AA", "GG", "GG"), X2339 = c("AA",
> "CC", "TT", "CC", "CC", "CC", "CC", "TT", "CC", "GA", "AA",
> "GG", "GG", "GT", "CT", "GG", "TT", "AA", "AA", "AG"), X2331 = c("AA",
> "TC", "TT", "CC", "TC", "CC", "CC", "TT", "CC", "GG", "GG",
> "GG", "GG", "TT", "CC", "GG", "CC", "AA", "AA", "AG"), X2343 = c("AA",
> "TC", "TT", "CC", "TC", "CC", "CC", "TT", "CC", "GG", "GG",
> "GG", "GG", "TT", "CT", "GG", "CC", "AA", "AA", "GA"), X2352 = c("AA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "AA",
> "GG", "GG", "TT", "CC", "GG", "CC", "AA", "GA", "AG"), X2293 = c("GA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GA",
> "AA", "GG", "TT", "TC", "AA", "CT", "AA", "AA", "AA"), X2338 = c("GA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GG",
> "AG", "GG", "TT", "TC", "AG", "TC", "AA", "AA", "GA"), X2449 = c("AA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "AG",
> "AA", "GG", "TT", "CC", "AA", "TC", "AA", "AA", "GA"), X2296 = c("GA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GA", "GG",
> "AG", "GG", "TG", "TC", "AG", "CC", "AA", "AA", "AA"), X2453 = c("AG",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "AG", "GG",
> "GA", "GG", "GT", "CT", "GA", &

Re: [R] how to count "A", "C", "T", "G" in each row in a big data.frame?

2013-01-09 Thread Yao He
Thanks a lot.

The problem is that I don't know how to handle the output list as I
want calculate the frequency of A or G or T or C by row.


Yao He
2013/1/10 Jessica Streicher :
> Sorry, you wanted rows, i wrote for columns
>
> #rows would be:
> test2<-apply(test[,-c(1:4)],1,function(x){table(t(x))})
>
> #find single values in a row
> sapply(test2,function(row){
> allVars<-paste(names(row),collapse="")
> u <- unique(strsplit(allVars,"")[[1]])
> parts<-sapply(names(row),function(x){u%in%strsplit(x,"")[[1]]})
> mat<-parts%*%row
> rownames(mat)<-u
> mat
> })
>
> though i guess lists aren't ideal, but theres another answer as well i see.
>
> On 09.01.2013, at 15:23, Yao He wrote:
>
>> Dear All
>>
>> I have a data.frame like that:
>> structure(list(name = c("Gga_rs10722041", "Gga_rs10722249", "Gga_rs10722565",
>> "Gga_rs10723082", "Gga_rs10723993", "Gga_rs10724555", "Gga_rs10726238",
>> "Gga_rs10726461", "Gga_rs10726774", "Gga_rs10726967", "Gga_rs10727581",
>> "Gga_rs10728004", "Gga_rs10728156", "Gga_rs10728177", "Gga_rs10728373",
>> "Gga_rs10728585", "Gga_rs10729598", "Gga_rs10729643", "Gga_rs10729685",
>> "Gga_rs10729827"), chr = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
>> 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), pos = c(11248993L,
>> 20038370L, 16164457L, 38050527L, 20307106L, 13707090L, 12230458L,
>> 36732967L, 2790856L, 1305785L, 29631963L, 13606593L, 13656397L,
>> 2261611L, 32096703L, 13733153L, 16524147L, 558735L, 12514023L,
>> 3619538L), strand = c("+", "+", "+", "+", "+", "+", "+", "+",
>> "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"),
>>X2353 = c("AA", "TT", "TT", "CC", "TT", "CC", "CC", "TT",
>>"CC", "GG", "AG", "AG", "AG", "TT", "CC", "AG", "CC", "AA",
>>"GG", "GG"), X2409 = c("AA", "CT", "TT", "CC", "CT", "CC",
>>"CC", "TT", "CC", "GG", "GG", "AG", "AG", "TT", "CC", "AG",
>>"CC", "AA", "AG", "GA"), X2500 = c("GA", "TT", "TT", "CC",
>>"TT", "CC", "CC", "TT", "CC", "GG", "GG", "GG", "GG", "GT",
>>"CT", "GG", "CC", "AA", "AA", "AA"), X2598 = c("AA", "TT",
>>"TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "AA", "AG",
>>"GG", "TT", "CC", "AG", "TC", "AA", "AA", "AG"), X2610 = c("AA",
>>"TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GA",
>>"GA", "GG", "TT", "CC", "GA", "CC", "AA", "AA", "GA"), X2300 = c("GA",
>>"TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GA",
>>"AA", "AG", "TT", "TC", "AA", "TC", "AA", "AG", "AA"), X2507 = c("AG",
>>"TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GG",
>>"GA", "GG", "TT", "TC", "GG", "CC", "AA", "GA", "AG"), X2530 = c("AG",
>>"TC", "TT", "CC", "TC", "CC", "CC", "TT", "CC", "GG", "AA",
>>"GG", "GG", "TT", "CC

Re: [R] how to count "A", "C", "T", "G" in each row in a big data.frame?

2013-01-09 Thread Yao He
It is really a good output. Maybe I could go on with this output.
Everytime I  understand R further from your help.
The first four cols are irrelevant. It is a negligence

2013/1/10 William Dunlap :
> Can you get what you need from the following, where 'd' is your data.frame,
> the first four columns of which are irrelevant to this problem?
>   > dd <- d[,-(1:4)] ; table(rownames(dd)[row(dd)], unlist(dd))
>
>   AA AG CC CT GA GG GT TC TG TT
> 27412 29 10  0  0 13  1  0  0  0  0
> 27413  0  0  4  9  0  0  0 12  0 28
> 27414  0  0  0  0  0  0  0  0  0 53
> 27415  0  0 53  0  0  0  0  0  0  0
> ...
> 27430 46  3  0  0  2  2  0  0  0  0
> 27431 19 15  0  0 15  4  0  0  0  0
> table() is pretty quick.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
>> -Original Message-
>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
>> Behalf
>> Of Yao He
>> Sent: Wednesday, January 09, 2013 4:04 PM
>> To: jim holtman
>> Cc: R help
>> Subject: Re: [R] how to count "A", "C", "T", "G" in each row in a big 
>> data.frame?
>>
>> In fact I want to calculate the gene frequency of each SNP.
>>
>> The key problems are that:
>> 1. my data.frame is large ,about 50,000 rows. So it is so slow to
>> split() it by row
>>
>> 2 .The allele in each SNP (each row) are different.Some are A/G, some
>> are G/C. It is a little bit embarrassed for me to handle it.
>>
>> Thank you for your help
>>
>> 2013/1/9 jim holtman :
>> > forgot the data.  this will count the characters; you can add logic
>> > with 'table' to count groups
>> >
>> > 
>> > x <-
>> > structure(list(name = c("Gga_rs10722041", "Gga_rs10722249", 
>> > "Gga_rs10722565",
>> > "Gga_rs10723082", "Gga_rs10723993", "Gga_rs10724555", "Gga_rs10726238",
>> > "Gga_rs10726461", "Gga_rs10726774", "Gga_rs10726967", "Gga_rs10727581",
>> > "Gga_rs10728004", "Gga_rs10728156", "Gga_rs10728177", "Gga_rs10728373",
>> > "Gga_rs10728585", "Gga_rs10729598", "Gga_rs10729643", "Gga_rs10729685",
>> > "Gga_rs10729827"), chr = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
>> > 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), pos = c(11248993L,
>> > 20038370L, 16164457L, 38050527L, 20307106L, 13707090L, 12230458L,
>> > 36732967L, 2790856L, 1305785L, 29631963L, 13606593L, 13656397L,
>> > 2261611L, 32096703L, 13733153L, 16524147L, 558735L, 12514023L,
>> > 3619538L), strand = c("+", "+", "+", "+", "+", "+", "+", "+",
>> > "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"),
>> > X2353 = c("AA", "TT", "TT", "CC", "TT", "CC", "CC", "TT",
>> > "CC", "GG", "AG", "AG", "AG", "TT", "CC", "AG", "CC", "AA",
>> > "GG", "GG"), X2409 = c("AA", "CT", "TT", "CC", "CT", "CC",
>> > "CC", "TT", "CC", "GG", "GG", "AG", "AG", "TT", "CC", "AG",
>> > "CC", "AA", "AG", "GA"), X2500 = c("GA", "TT", "TT", "CC",
>> > "TT", "CC", "CC", "TT", "CC", "GG", "GG", "GG", "GG", "GT",
>> > "CT", "GG", "CC", "AA", "AA", "AA"), X2598 = c("AA", "TT",
>> > "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "AA", "AG",
>> > "GG", "TT", "CC", "AG", "TC", "AA", "AA", "AG"), X2610 = c("AA",
>> > "TT", "TT", "CC", "TT", "CC", "CC", "TT", "C

Re: [R] how to count "A", "C", "T", "G" in each row in a big data.frame?

2013-01-09 Thread Yao He
Hi arun
Then how could spilt them and get a table of letters count such as:
  id AA AG CC CT GA GG GT TC TG TT
  id   A T C G
> #1 27412 81 0 0 25
> #2 27413  0  77 29 0

 Thanks

2013/1/10 arun :
> Hi Yao,
> You could also use:
> library(reshape2)
> dd<-dat1[,-(1:4)]
> res<-dcast(melt(within(dd,{id=row.names(dd)}),id.var="id"),id~value,length)
> head(res)
> # id AA AG CC CT GA GG GT TC TG TT
> #1 27412 29 10  0  0 13  1  0  0  0  0
> #2 27413  0  0  4  9  0  0  0 12  0 28
> #3 27414  0  0  0  0  0  0  0  0  0 53
> #4 27415  0  0 53  0  0  0  0  0  0  0
> #5 27416  0  0  3  9  0  0  0 12  0 29
> #6 27417  0  0 53  0  0  0  0  0  0  0
>
> #Just for comparison:
> dat2<- dat1[rep(row.names(dat1),2000),]
>  nrow(dat2)
> #[1] 4
>  row.names(dat2)<-1:4
>  dd <- dat2[,-(1:4)]
>   system.time(res1<- table(rownames(dd)[row(dd)], unlist(dd)))
> #   user  system elapsed
> #  5.840   0.104   5.954
>  system.time(res2 <- 
> dcast(melt(within(dd,{id=row.names(dd)}),id.var="id"),id~value,length))
> #   user  system elapsed
> #  3.100   0.064   3.167
>  head(res1,3)
>
>  # AA AG CC CT GA GG GT TC TG TT
>  # 1   29 10  0  0 13  1  0  0  0  0
>  # 10   0  4  0  0  6 43  0  0  0  0
>  # 100 19 15  0  0 15  4  0  0  0  0
>  head(res2,3)
> #   id AA AG CC CT GA GG GT TC TG TT
> #1   1 29 10  0  0 13  1  0  0  0  0
> #2  10  0  4  0  0  6 43  0  0  0  0
> #3 100 19 15  0  0 15  4  0  0  0  0
>
> A.K.
>
>
>
>
>
>
>
> - Original Message -
> From: Yao He 
> To: R help 
> Cc:
> Sent: Wednesday, January 9, 2013 9:23 AM
> Subject: [R] how to count "A","C","T","G" in each row in a big data.frame?
>
> Dear All
>
> I have a data.frame like that:
> structure(list(name = c("Gga_rs10722041", "Gga_rs10722249", "Gga_rs10722565",
> "Gga_rs10723082", "Gga_rs10723993", "Gga_rs10724555", "Gga_rs10726238",
> "Gga_rs10726461", "Gga_rs10726774", "Gga_rs10726967", "Gga_rs10727581",
> "Gga_rs10728004", "Gga_rs10728156", "Gga_rs10728177", "Gga_rs10728373",
> "Gga_rs10728585", "Gga_rs10729598", "Gga_rs10729643", "Gga_rs10729685",
> "Gga_rs10729827"), chr = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
> 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), pos = c(11248993L,
> 20038370L, 16164457L, 38050527L, 20307106L, 13707090L, 12230458L,
> 36732967L, 2790856L, 1305785L, 29631963L, 13606593L, 13656397L,
> 2261611L, 32096703L, 13733153L, 16524147L, 558735L, 12514023L,
> 3619538L), strand = c("+", "+", "+", "+", "+", "+", "+", "+",
> "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"),
> X2353 = c("AA", "TT", "TT", "CC", "TT", "CC", "CC", "TT",
> "CC", "GG", "AG", "AG", "AG", "TT", "CC", "AG", "CC", "AA",
> "GG", "GG"), X2409 = c("AA", "CT", "TT", "CC", "CT", "CC",
> "CC", "TT", "CC", "GG", "GG", "AG", "AG", "TT", "CC", "AG",
> "CC", "AA", "AG", "GA"), X2500 = c("GA", "TT", "TT", "CC",
> "TT", "CC", "CC", "TT", "CC", "GG", "GG", "GG", "GG", "GT",
> "CT", "GG", "CC", "AA", "AA", "AA"), X2598 = c("AA", "TT",
> "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "AA", "AG",
> "GG", "TT", "CC", "AG", "TC", "AA", "AA", "AG"), X2610 = c("AA",
> "TT", "TT", "CC", "TT", "CC", "CC", "TT", "CC", "GG", "GA",
> "GA", "GG", "TT", "CC", "GA", "CC", "AA", "AA", "GA"), X2300 = c("GA",
> &qu

Re: [R] how to generate a matrix by an my data.frame

2013-01-10 Thread Yao He
Thanks a lot
it works!

2013/1/11 Rui Barradas :
> Hello,
>
> Here are two ways.
>
> dat <- read.table(text = "
>
> id1id2   value
> 2353  2353  0.096313
> 2353  2409  0.301773
> [...etc...]
>
> 2356  2356  0
> 2356  2611  0
> 2611  2611  0
> ", header = TRUE)
>
> mat1 <- matrix(nrow = 53, ncol = 53)  # initialize with NA's
> mat1[upper.tri(mat1, diag = TRUE)] <- dat$value
>
> mat2 <- matrix(0, nrow = 53, ncol = 53)  # initialize with zeros
> mat2[upper.tri(mat2, diag = TRUE)] <- dat$value
>
>
> Hope this helps,
>
> Rui Barradas
> Em 10-01-2013 15:21, Yao He escreveu:
>
> Dear All
>
> It is a little hard to give a good small example of my question,so I
> will show  the full data on the bottom and the attachment.Maybe some
> one could tell me an appropriate way
> to show it.I'm sorry for the inconvenience.
>
>
> Q:How to generate a  53*53 diagonal matrix by my data
> Some problems confused me are that:
> 1.Since it is a  diagonal matrix,I have tried to transform col1 and
> col2 to rowindex and colindex ,but I don't know how to generate matrix
> by its value's index
> 2. As you see, the number of  2353 corresponding to other ids in col2
> is 53,however,the number of 2409 corresponding to other ids in col2 is
> 52 and 2500 corresponding to 51 values and so on,so it is hard to use
> matrix() to generate it
>
> id1id2   value
> 2353  23530.096313
> 2353  24090.301773
> 2353  25000.169518
> 2353  25980.11274
> 2353  26100.107414
> 2353  23000.034492
> 2353  25070.037521
> 2353  25300.064125
> 2353  23270.029259
> 2353  23890.036423
> 2353  24080.029259
> 2353  24630.036423
> 2353  24200.04409
> 2353  25630.055038
> 2353  24620.046478
> 2353  22920.036369
> 2353  24050.036369
> 2353  25430.053413
> 2353  25570.058151
> 2353  25830.081512
> 2353  23220.044373
> 2353  25350.04847
> 2353  25360.035538
> 2353  25810.035538
> 2353  25700.07711
> 2353  24760.047081
> 2353  25340.047081
> 2353  22800.088264
> 2353  23160.073608
> 2353  23390.067307
> 2353  23310.061172
> 2353  23430.060425
> 2353  23520.041153
> 2353  22930.040764
> 2353  23380.045128
> 2353  24490.040764
> 2353  22960.061333
> 2353  24530.046074
> 2353  24600.060387
> 2353  24740.060387
> 2353  26030.060387
> 2353  22820.048065
> 2353  23130.05584
> 2353  25380.050873
> 2353  25220.065727
> 2353  24890.041023
> 2353  25640.039696
> 2353  25940.056946
> 2353  22740.060875
> 2353  24510.037468
> 2353  23210
> 2353  23560
> 2353  26110
> 2409  24090.096313
> 2409  25000.169518
> 2409  25980.11274
> 2409  26100.107414
> 2409  23000.034492
> 2409  25070.037521
> 2409  25300.064125
> 2409  23270.029259
> 2409  23890.036423
> 2409  24080.029259
> 2409  24630.036423
> 2409  24200.04409
> 2409  25630.055038
> 2409  24620.046478
> 2409  22920.036369
> 2409  24050.036369
> 2409  25430.053413
> 2409  25570.058151
> 2409  25830.081512
> 2409  23220.044373
> 2409  25350.04847
> 2409  25360.035538
> 2409  25810.035538
> 2409  25700.07711
> 2409  24760.047081
> 2409  25340.047081
> 2409  22800.088264
> 2409  23160.073608
> 2409  23390.067307
> 2409  23310.061172
> 2409  23430.060425
> 2409  23520.041153
> 2409  22930.040764
> 2409  23380.045128
> 2409  24490.040764
> 2409  22960.061333
> 2409  24530.046074
> 2409  24600.060387
> 2409  24740.060387
> 2409  26030.060387
> 2409  22820.048065
> 2409  23130.05584
> 2409  25380.050873
> 2409  25220.065727
> 2409  24890.041023
> 2409  25640.039696
> 2409  25940.056946
> 2409  22740.060875
> 2409  24510.037468
> 2409  23210
> 2409  23560
> 2409  26110
> 2500  25000.048615
> 2500  25980.051979
> 2500  26100.041031
> 2500  23000.032974
> 2500  25070.052788
> 2500  25300.041435
> 2500  23270.038071
> 2500  23890.051659
> 2500  24080.038071
> 2500  24630.051659
> 2500  24200.052635
> 2500  25630.07872
> 2500  24620.048615
> 2500  22920.044365
> 2500  24050.044365
> 2500  25430.04277
> 2500  25570.051109
> 2500  25830.047409
> 2500  23220.054512
>

[R] how to read a df like that and transform it?

2013-01-23 Thread Yao He
Dear all

I have a data.frame like that :

father  mother  num_daughterdaughter
291 39060   NULL
275 42190   NULL
273 42361   49410
281 41631   49408
274 42261   49406
295 38692   49403
49404
287 41130   NULL
295 38711   49401
292 38954   49396
49397
49398
49399
291 39003   49392

How to read it into R and transform it like that:

father mother   num_daughter   daughter1  daughter2  daughter3 daughter4
291 39060   NULL
275 42190   NULL
273 42361   49410
281 41631   49408
274 42261   49406
295 38692   49403  49404
287 41130   NULL
295 38711   49401
292 38954   49396  4939749398   49399
291 39003   49392

library (plyr) and library (reshape2) and other good packages are  OK for me.

Thanks a lot!

Yao He
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to transpose it in a fast way?

2013-03-13 Thread Yao He
Thanks for everybody's help!

I learn a lot from this discuss!



2013/3/10 jim holtman :
> Did you check out the 'colbycol' package.
>
> On Fri, Mar 8, 2013 at 5:46 PM, Martin Morgan  wrote:
>
>> On 03/08/2013 06:01 AM, Jan van der Laan wrote:
>>
>>>
>>> You could use the fact that scan reads the data rowwise, and the fact that
>>> arrays are stored columnwise:
>>>
>>> # generate a small example dataset
>>> exampl <- array(letters[1:25], dim=c(5,5))
>>> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
>>>  sep="\t", quote=FALSE)
>>>
>>> # and read...
>>> d <- scan("example.dat", what=character())
>>> d <- array(d, dim=c(5,5))
>>>
>>> t(exampl) == d
>>>
>>>
>>> Although this is probably faster, it doesn't help with the large size.
>>> You could
>>> used the n option of scan to read chunks/blocks and feed those to, for
>>> example,
>>> an ff array (which you ideally have preallocated).
>>>
>>
>> I think it's worth asking what the overall goal is; all we get from this
>> exercise is another large file that we can't easily manipulate in R!
>>
>> But nothing like a little challenge. The idea I think would be to
>> transpose in chunks of rows by scanning in some number of rows and writing
>> to a temporary file
>>
>> tpose1 <- function(fin, nrowPerChunk, ncol) {
>> v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
>> m <- matrix(v, ncol=ncol, byrow=TRUE)
>> fout <- tempfile()
>> write(m, fout, nrow(m), append=TRUE)
>> fout
>> }
>>
>> Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k
>> at a time from some file fl <- "big.txt"
>>
>> ncol <- 6L
>> nrowPerChunk <- 1L
>> nChunks <- ncol / nrowPerChunk
>>
>> fin <- file(fl); open(fin)
>> fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
>> close(fin)
>>
>> 'fls' is now a vector of file paths, each containing a transposed slice of
>> the matrix. The next task is to splice these together. We could do this by
>> taking a slice of rows from each file, cbind'ing them together, and writing
>> to an output
>>
>> splice <- function(fout, cons, nrowPerChunk, ncol) {
>> slices <- lapply(cons, function(con) {
>> v <- scan(con, character(), nmax=nrowPerChunk * ncol)
>> matrix(v, nrowPerChunk, byrow=TRUE)
>> })
>> m <- do.call(cbind, slices)
>> write(t(m), fout, ncol(m), append=TRUE)
>> }
>>
>> We'd need to use open connections as inputs and output
>>
>> cons <- lapply(fls, file); for (con in cons) open(con)
>> fout <- file("big_transposed.txt"); open(fout, "w")
>> xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk,
>> nrowPerChunk))
>> for (con in cons) close(con)
>> close(fout)
>>
>> As another approach, it looks like the data are from genotypes. If they
>> really only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT'
>> could be encoded as a single byte
>>
>> alf <- c("A", "C", "G", "T")
>> nms <- outer(alf, alf, paste0)
>> map <- outer(setNames(as.raw(0:15), nms),
>>  setNames(as.raw(bitwShiftL(0:**15, 4)), nms),
>>  "|")
>>
>> with e.g.,
>>
>> > map[matrix(c("AA", "CT"), ncol=2)]
>> [1] d0
>>
>> This translates the problem of representing the 60k x 60k array as a 3.6
>> billion element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of
>> 60k x 30k = 1.8 billion elements (fits in R-2.15 vectors) of approx 1.8
>> Gbyte (probably usable in an 8 Gbyte laptop).
>>
>> Personally, I would probably put this data in a netcdf / rdf5 file.
>> Perhaps I'd use snpStats or GWAStools in Bioconductor
>> http://bioconductor.org.
>>
>> Martin
>>
>>
>>> HTH,
>>>
>>> Jan
>>>
>>>
>>>
>>>
>>> peter dalgaard  schreef:
>>>
>>>  On Mar 7, 2013, at 01:18 , Yao He wrote:
>>>>
>>>>  Dear all:
>>>>>
>>>>> I 

[R] Do association study based on mixed linear model

2013-03-19 Thread Yao He
Dear All

I want to do association study based on mixed linear model,

My model not only includes serval fixed effects and random effects but
also incorporates some covariates such as "birth weight".
Otherwise, the size of the data are about 180 individuals and 12
variables and 6 Fixed effect estimates

As asreml-R is not free ,is there any packages for my study?
I heard  nlme or lme4 but I'm not sure whether they could incorporate
covariates and what about their computational efficiency?

Thanks for you recommendation

Yao He
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to do association study based on mixed linear model

2013-03-19 Thread Yao He
Dear All:

I want to do association study based on mixed linear model,

My model not only includes serval fixed effects and random effects but
also incorporates some covariates such as "birth weight".
Otherwise, the size of the data are about 180 individuals and 12
variables and 6 Fixed effect estimates

As asreml-R is not free ,is there any packages for my study?
I heard  nlme or lme4 but I'm not sure whether they could incorporate
covariates and what about their computational efficiency?

Thanks for you recommendation

Yao He
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to select a subset data to do a barplot in ggplot2

2012-12-13 Thread Yao He
Hi,everybody

I have a dataframe like this

FID IID STATUS
14621live
14628dead
24631live
24632live
24633live
24634live
64675live
64679dead
104716dead
104719live
104721dead
114726live
114728nosperm
114730nosperm
124732live
174783live
174783live
174784live

I just want a barblot to count "live" or "dead" in every "FID", and fill
the bar with different colour.

I try these codes:

p<-ggplot(data,aes(x=FID));
p+geom_bar(aes(x=factor(FID),y=..count..,fill=STATUS))

But how could I exclude "nosperm" or other levels just in the use of
ggplot2 without generating another dataframe

Thanks a lot

Yao He
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com 
——

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to handle NA values in aggregate()

2012-12-15 Thread Yao He
Dear All:

I am trying to calculate four columns' means in a dataframe like this:

FID  MID IID EW_INCU EW_17.5   EMWEEratio
1   4621  TWF2H545.26NA 15.61 NA
1   4621  TWF2H648.0244.09 13.41  0.3041506
2   4630  TWF2H19   51.44   47.81 NA NA
2   4631  TWF2H21   NA  52.72 16.70  0.3167678
2   4632  TWF2H22   55.70   50.45 16.48  0.3266601
2   4633  TWF2H23   44.42   40.89 12.96  0.3169479

I try this code

> aggregate(df[,4:7],df[,1],mean)

But I couldn't set the agrument na.rm=T in the mean() function,so the
results are all NAs

Please tell me how to handle NA values in the use of aggregate()

Thanks a lot

Yao He
—
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1...@gmail.com
——

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.