[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

Ryan C. Thompson Mon, 19 Nov 2012 18:11:25 -0800

In my work, I've developed a few useful functions relating to edgeR andDESeq. First, I wrote a version of glmQLFTest that does not throw errorson zero-count genes. See the comment for details on what exactly it does:


## This version of glmQLFTest excludes genes with zero counts in all
## samples (logCPM == -Inf), because these cause squeezeVar to throw
## errors. Instead, it just arbitrarily sets their F statistics to
## zero and P-Values to 1, which is appropriate because obviously we
## cannot call differential expression if all counts are zero.

glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design),contrast = NULL,

                             abundance.trend = TRUE)
{
  present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
  out <- glmLRT(glmfit, coef = coef, contrast = contrast)
  df.residual <- glmfit$df.residual
  s2 <- glmfit$deviance/df.residual
  df.residual[s2 < 1e-14] <- 0
  s2 <- pmax(s2, 0)
  if (abundance.trend)  {
    s2.fit <- squeezeVar(s2[present],
                         df = df.residual[present],
                         covariate = glmfit$abundance[present])
  }
  else {
    s2.fit <- squeezeVar(s2, df = df.residual)
  }
  F <- rep(0, length(s2))
  F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
  df.total <- s2.fit$df.prior + df.residual
  max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
  df.total <- min(df.total, length(s2) * max.df.residual)
  F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
                 log.p = FALSE)
  i <- s2.fit$var.post < 1
  if (any(i)) {
    chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
                           lower.tail = FALSE, log.p = FALSE)
    F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
  }
  out$table$LR <- out$table$PValue <- NULL
  out$table$F <- F
  out$table$PValue <- F.pvalue
  out$s2.fit <- s2.fit
  out$df.prior <- s2.fit$df.prior
  out$df.total <- df.total
  out
}

Secondly, I wanted to use edgeR's glmLRT and topTags function togenerate nice tables for my DESeq results, so I wrote a few functionsthat allow me to use edgeR's glmFit function on a DESeq CountDataSet toyield a DGEGLM object just as if glmFit was called on a DGEList object.I have verified that these functions produce nearly identical fitcoefficients to DESeq's own fitNbinomGLMs, and the resulting p-valuesfrom glmLRT are nearly identical to DESeq's nbinomGLMTest, so thisaccurately reproduces the DESeq method using edgeR functions. Thisallows you to use edgeR's topTags to get a nice table with PValues, FDR,logFC, logCPM, and annotations. I also tried glmQLFTest on the results,but it didn't change any of the p-values because the F-test derivedp-values were all smaller so it just kept the chi-squared-derivedp-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway,here are the functions:


## The following two functions allow the use to edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
  if (any(is.na(sizeFactors(y))))
    stop("Call estimateDispersions first")
  log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}

glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL,offset = NULL,weights = NULL, lib.size = NULL, start= NULL, method = "auto",

                                 ...)
{
  stopifnot(is(y, "CountDataSet"))
  if (is.null(dispersion)) {
    if ("disp_pooled" %in% colnames(fData(y)))
      dispersion <- fData(y)$disp_pooled
    else if ("disp_blind" %in% colnames(fData(y))) {
      if (fitInfo(y, "blind")$sharingMode != "fit-only")

warning("You have used 'method=\"blind\"' in estimateDispersionwithout also setting 'sharingMode=\"fit-only\"'. This will not yielduseful results.")

      dispersion <- fData(y)$disp_blind
    }

else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or'blind') first.")

  }
  if (is.null(offset) && is.null(lib.size))
    offset <- getOffset.CountDataSet(y)
  ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
  ## cause errors in glmFit. To fix this, we replace infinite
  ## dispersion estimates with the maximum representable floating
  ## point value, which should always result in a PValue of 1.
  infdisp <- !is.finite(dispersion)
  dispersion[infdisp] <- .Machine$double.xmax

fit <- glmFit.default(y = counts(y), design = design, dispersion =dispersion,offset = offset, weights = weights, lib.size =lib.size,

                        start = start, method = method, ...)
  ## Now set the dispersions back to infinite in the resulting fit object.
  fit$dispersion[infdisp] <- +Inf
  dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
  fit$counts <- counts(y)
  fit$samples <- pData(y)[-1]

fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind","disp_pooled"))]

  new("DGEGLM", fit)
}

Note that while I have named the functions like S3 method, I haven'tactually tested whether method dispatch works properly with them or not,since DESeq uses S4 classes. In my code, I call them by their full namesinstead of relying on method dispatch.

Please let me know if these seem useful to anyone. I've found themuseful, because now I can produce edgeR and DESeq results in exactly thesame format (via topTags).


Regards,

-Ryan Thompson

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

Reply via email to