Thanks a million, Chuck! I think I got it. Have a good weekend!
Andy Charles C. Berry wrote: > > On Thu, 10 Jan 2008, AndyZon wrote: > >> >> Thank you so much, Chuck! >> >> This is brilliant, I just tried some dichotomous variables, it was really >> fast. > > > Yes, and if you are on a multicore system with multithreaded linear > algebra, crossprod() will distribute the job across the cores making the > elapsed time shorter (by almost half on my Core 2 Duo MacBook as long as I > have nothing else gobbling up CPU cycles)! > >> >> Most categorical variables I am interested in are 3 levels, they are >> actually SNPs, I want to look at their interactions. My question is: >> after >> generating 0-1 codings, like 00, 01, 10, how should I use "crossprod()"? >> Should I just apply this function on these 2*n columns (originally I have >> n >> variables), and then operate on the generated cell counts? > > > If I followed you here, and you have ONLY those three categories, then > yes. > > Try a test case with perhaps 3 SNPs and a few subjects. Table the results > the old fashioned way via table() or xtabs() or even by hand. Then look at > what crossprod( test.case } gives you. > > --- > > If '11' shows up you'll have to use a 'contr.treatment' style > approach. ( run 'example( contrasts )' and look at what is going on). > > Guess what these give: > > contrasts( factor( c( "00","01","10" ) ) ) > contrasts( factor( c( "00","01","10","11" ) ) ) > > then run them if you have trouble seeing why '11' changes the picture. > > --- > > BTW, what I said (below) suggests that crossprod() returns integer > values, but its storage.mode is actually "double". > > HTH, > > Chuck > > I am confused >> about this. >> >> Your input will be greatly appreciated. >> >> Andy > > > > >> >> >> >> Charles C. Berry wrote: >>> >>> On Wed, 9 Jan 2008, AndyZon wrote: >>> >>>> >>>> Hi, >>>> >>>> I have a huge number of categorical variables, say at least 10000, and >>>> I >>>> put >>>> them into a matrix, each column is one variable. The question is: how >>>> can >>>> I >>>> make all of the pairwise cross tabulation tables efficiently? The >>>> straightforward solution is to use for-loops by looping two indexes on >>>> the >>>> table() function, but it was just too slow. Is there a more efficient >>>> way >>>> to >>>> do that? Any guidance will be greatly appreciated. >>> >>> The totals are merely the crossproducts of a suitably constructed binary >>> (zero-one) matrix is used to encode the categories. See >>> '?contr.treatment' >>> if you cannot grok 'suitably constructed'. >>> >>> If the categories are all dichotomies coded as 0:1, you can use >>> >>> res <- crossprod( dat ) >>> >>> to find the totals for the (1,1) cells >>> >>> If you need the full tables, you can get them from the marginal totals >>> using >>> >>> diag( res ) >>> >>> to get the number in each '1' margin and >>> >>> nrow(dat) >>> >>> to get the table total from which the numbers in each '0' margin by >>> subtracting the corresponding '1' margin. >>> >>> With dichotomous variables, dat has 10000 columns and you will only need >>> 10000^2 integers or about 0.75 Gigabytes to store the 'res'. And it >>> takes >>> about 20 seconds to run 1000 rows on my MacBook. Of course, 'res' has a >>> redundant triangle >>> >>> This approach generalizes to any number of categories: >>> >>> To extend this to more than two categories, you will need to do for each >>> such column what model.matrix(~factor( dat[,i] ) ) does by default >>> ( using 'contr.treatment' ) - construct zero-one codes for all but one >>> (reference) category. >>> >>> Note that with 10000 trichotomies, you will have a result with >>> >>> 10000^2 * ( 3-1 )^2 >>> >>> integers needing about 3 Gigabytes, and so on. >>> >>> HTH, >>> >>> Chuck >>> >>> p.s. Why on Earth are you doing this???? >>> >>> >>>> >>>> Andy >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14723520.html >>>> Sent from the R help mailing list archive at Nabble.com. >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> Charles C. Berry (858) 534-2098 >>> Dept of Family/Preventive >>> Medicine >>> E mailto:[EMAIL PROTECTED] UC San Diego >>> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego >>> 92093-0901 >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14744086.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:[EMAIL PROTECTED] UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://www.nabble.com/pairwise-cross-tabulation-tables-tp14723520p14766784.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.