[Bioc-devel] Making hypothesis testing easier with design matrices?

Ryan C. Thompson Mon, 10 Dec 2012 02:13:24 -0800

Hi Gordon and list,

I've been thinking about how to make it easier to specify whathypotheses one wants to test in microarray or RNA-seq differentialexpression data sets, and I think one of the major stumbling blocks thatconfuses people is the way in which design matrices must have onecoefficient "missing" for each term. So if you have several experimentalfactors and blocking factors, you can't have a column of the designmatrix corresponding to every level of every factor in the formula. ButI believe every level of every factor could be represented as a contrastof one or more coefficients in the design matrix. For example, if youhad a variable "cond" with 3 levels "A", "B", and "C", and you did"model.matrix(~1+condition)", you get a design matrix with an intercept,a B-A term, and a C-A term, with names "Intercept", "condB", and"condC". From this, you could solve for the contrast representing eachof A, B and C. For example, I believe A is "Intercept - (condB +condC)/3". Expressed as a contrast vector in R, this would be "c(1,-1/3, -1/3)". (Of course, for this trivial example one can just do"model.matrix(~0+cond)", but that doesn't work for all the factors in amulti-factor design.)

So, in the same step as the design matrix is created, the function couldalso return, regardless of how the model formula was parametrized, amatrix where each column is the contrast corresponding to one level ofone of the factors in the model formula. (This could be added as anattribute on the design matrix, for example.) The user could then addand subtract these columns (perhaps with a helper function similar tomakeContrasts that allows it to be done symbolically) to get thecontrasts that they want without having to worry about exactly how thecontrasts are coded into the design matrix. Obviously, for multi-factordesigns, this matrix of factor levels coded as contrasts would have morecolumns than the design matrix itself. For example, if an experiment hasa 2-level factor and a 3-level factor, then the design matrix would have4 columns, but the "available factor level matrix" would have 5 columns.

The advantage of such a scheme would be that the computer can tell theuser in addition to the coefficients in the design matrix, "here are theavailable factor levels that you can perform comparisons on", and theuser could pick the ones they are interested in and and add/subtractthem to get the test they want.

What do you think of this idea? Could it work in practice for limma andedgeR? I would be interested in writing code to make it a reality if youthought it was worthwhile.


Sincerely,
-Ryan Thompson

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Making hypothesis testing easier with design matrices?

Reply via email to