Did you check out the 'colbycol' package.

On Fri, Mar 8, 2013 at 5:46 PM, Martin Morgan <mtmor...@fhcrc.org> wrote:

> On 03/08/2013 06:01 AM, Jan van der Laan wrote:
>
>>
>> You could use the fact that scan reads the data rowwise, and the fact that
>> arrays are stored columnwise:
>>
>> # generate a small example dataset
>> exampl <- array(letters[1:25], dim=c(5,5))
>> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
>>      sep="\t", quote=FALSE)
>>
>> # and read...
>> d <- scan("example.dat", what=character())
>> d <- array(d, dim=c(5,5))
>>
>> t(exampl) == d
>>
>>
>> Although this is probably faster, it doesn't help with the large size.
>> You could
>> used the n option of scan to read chunks/blocks and feed those to, for
>> example,
>> an ff array (which you ideally have preallocated).
>>
>
> I think it's worth asking what the overall goal is; all we get from this
> exercise is another large file that we can't easily manipulate in R!
>
> But nothing like a little challenge. The idea I think would be to
> transpose in chunks of rows by scanning in some number of rows and writing
> to a temporary file
>
>     tpose1 <- function(fin, nrowPerChunk, ncol) {
>         v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
>         m <- matrix(v, ncol=ncol, byrow=TRUE)
>         fout <- tempfile()
>         write(m, fout, nrow(m), append=TRUE)
>         fout
>     }
>
> Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k
> at a time from some file fl <- "big.txt"
>
>     ncol <- 60000L
>     nrowPerChunk <- 10000L
>     nChunks <- ncol / nrowPerChunk
>
>     fin <- file(fl); open(fin)
>     fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
>     close(fin)
>
> 'fls' is now a vector of file paths, each containing a transposed slice of
> the matrix. The next task is to splice these together. We could do this by
> taking a slice of rows from each file, cbind'ing them together, and writing
> to an output
>
>     splice <- function(fout, cons, nrowPerChunk, ncol) {
>         slices <- lapply(cons, function(con) {
>             v <- scan(con, character(), nmax=nrowPerChunk * ncol)
>             matrix(v, nrowPerChunk, byrow=TRUE)
>         })
>         m <- do.call(cbind, slices)
>         write(t(m), fout, ncol(m), append=TRUE)
>     }
>
> We'd need to use open connections as inputs and output
>
>     cons <- lapply(fls, file); for (con in cons) open(con)
>     fout <- file("big_transposed.txt"); open(fout, "w")
>     xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk,
> nrowPerChunk))
>     for (con in cons) close(con)
>     close(fout)
>
> As another approach, it looks like the data are from genotypes. If they
> really only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT'
> could be encoded as a single byte
>
>     alf <- c("A", "C", "G", "T")
>     nms <- outer(alf, alf, paste0)
>     map <- outer(setNames(as.raw(0:15), nms),
>                  setNames(as.raw(bitwShiftL(0:**15, 4)), nms),
>                  "|")
>
> with e.g.,
>
> > map[matrix(c("AA", "CT"), ncol=2)]
> [1] d0
>
> This translates the problem of representing the 60k x 60k array as a 3.6
> billion element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of
> 60k x 30k = 1.8 billion elements (fits in R-2.15 vectors) of approx 1.8
> Gbyte (probably usable in an 8 Gbyte laptop).
>
> Personally, I would probably put this data in a netcdf / rdf5 file.
> Perhaps I'd use snpStats or GWAStools in Bioconductor
> http://bioconductor.org.
>
> Martin
>
>
>> HTH,
>>
>> Jan
>>
>>
>>
>>
>> peter dalgaard <pda...@gmail.com> schreef:
>>
>>  On Mar 7, 2013, at 01:18 , Yao He wrote:
>>>
>>>  Dear all:
>>>>
>>>> I have a big data file of 60000 columns and 60000 rows like that:
>>>>
>>>> AA AC AA AA .......AT
>>>> CC CC CT CT.......TC
>>>> ..........................
>>>> .........................
>>>>
>>>> I want to transpose it and the output is a new like that
>>>> AA CC ............
>>>> AC CC............
>>>> AA CT.............
>>>> AA CT.........
>>>> ....................
>>>> ....................
>>>> AT TC.............
>>>>
>>>> The keypoint is  I can't read it into R by read.table() because the
>>>> data is too large,so I try that:
>>>> c<-file("silygenotype.txt","r"**)
>>>> geno_t<-list()
>>>> repeat{
>>>>  line<-readLines(c,n=1)
>>>>  if (length(line)==0)break  #end of file
>>>>  line<-unlist(strsplit(line,"\**t"))
>>>> geno_t<-cbind(geno_t,line)
>>>> }
>>>> write.table(geno_t,"xxx.txt")
>>>>
>>>> It works but it is too slow ,how to optimize it???
>>>>
>>>
>>>
>>> As others have pointed out, that's a lot of data!
>>>
>>> You seem to have the right idea: If you read the columns line by line
>>> there is
>>> nothing to transpose. A couple of points, though:
>>>
>>> - The cbind() is a potential performance hit since it copies the list
>>> every
>>> time around. geno_t <- vector("list", 60000) and then
>>> geno_t[[i]] <- <etc>
>>>
>>> - You might use scan() instead of readLines, strsplit
>>>
>>> - Perhaps consider the data type as you seem to be reading strings with
>>> 16
>>> possible values (I suspect that R already optimizes string storage to
>>> make
>>> this point moot, though.)
>>>
>>> --
>>> Peter Dalgaard, Professor
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Email: pd....@cbs.dk  Priv: pda...@gmail.com
>>>
>>> ______________________________**________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>> PLEASE do read the posting guide http://www.R-project.org/**
>>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________**________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/**
>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> ______________________________**________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide http://www.R-project.org/**
> posting-guide.html <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to