Hi Gordon and list,

I've been thinking about how to make it easier to specify what hypotheses one wants to test in microarray or RNA-seq differential expression data sets, and I think one of the major stumbling blocks that confuses people is the way in which design matrices must have one coefficient "missing" for each term. So if you have several experimental factors and blocking factors, you can't have a column of the design matrix corresponding to every level of every factor in the formula. But I believe every level of every factor could be represented as a contrast of one or more coefficients in the design matrix. For example, if you had a variable "cond" with 3 levels "A", "B", and "C", and you did "model.matrix(~1+condition)", you get a design matrix with an intercept, a B-A term, and a C-A term, with names "Intercept", "condB", and "condC". From this, you could solve for the contrast representing each of A, B and C. For example, I believe A is "Intercept - (condB + condC)/3". Expressed as a contrast vector in R, this would be "c(1, -1/3, -1/3)". (Of course, for this trivial example one can just do "model.matrix(~0+cond)", but that doesn't work for all the factors in a multi-factor design.)

So, in the same step as the design matrix is created, the function could also return, regardless of how the model formula was parametrized, a matrix where each column is the contrast corresponding to one level of one of the factors in the model formula. (This could be added as an attribute on the design matrix, for example.) The user could then add and subtract these columns (perhaps with a helper function similar to makeContrasts that allows it to be done symbolically) to get the contrasts that they want without having to worry about exactly how the contrasts are coded into the design matrix. Obviously, for multi-factor designs, this matrix of factor levels coded as contrasts would have more columns than the design matrix itself. For example, if an experiment has a 2-level factor and a 3-level factor, then the design matrix would have 4 columns, but the "available factor level matrix" would have 5 columns.

The advantage of such a scheme would be that the computer can tell the user in addition to the coefficients in the design matrix, "here are the available factor levels that you can perform comparisons on", and the user could pick the ones they are interested in and and add/subtract them to get the test they want.

What do you think of this idea? Could it work in practice for limma and edgeR? I would be interested in writing code to make it a reality if you thought it was worthwhile.

Sincerely,
-Ryan Thompson

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to