On Mon, 21 Jan 2008, Boks, M.P.M. wrote:
>
> Dear R-experts,
>
> My problem is how to handle a 10GB data file containing genotype data. The
> file is in a particular format (Illumina final report) and needs to be
> altered
> and merged with phenotype data for further analysis.
>
If the data have all the SNPs for one individual, then all the SNPs for the
next individual, and so on, you can read in 305000 lines of data, look up the
phenotype, then write out one line of output, eg with cat().
As another approach, I've been using the ncdf package for handling Illumina
genotype data (slightly larger datasets, and multiple phenotypes). This has
been faster and more compact than SQLite (because it doesn't need indexes to do
random access by person and by SNP). It is then easy to write analyses by SNP
(association tests) or analyses by person (allele sharing, population
structure), and even analyses by genomic region (all SNPs in chr9q21.3)
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
[EMAIL PROTECTED] University of Washington, Seattle
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.