In my work, I've developed a few useful functions relating to edgeR and DESeq. First, I wrote a version of glmQLFTest that does not throw errors on zero-count genes. See the comment for details on what exactly it does:

## This version of glmQLFTest excludes genes with zero counts in all
## samples (logCPM == -Inf), because these cause squeezeVar to throw
## errors. Instead, it just arbitrarily sets their F statistics to
## zero and P-Values to 1, which is appropriate because obviously we
## cannot call differential expression if all counts are zero.
glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design), contrast = NULL,
                             abundance.trend = TRUE)
{
  present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
  out <- glmLRT(glmfit, coef = coef, contrast = contrast)
  df.residual <- glmfit$df.residual
  s2 <- glmfit$deviance/df.residual
  df.residual[s2 < 1e-14] <- 0
  s2 <- pmax(s2, 0)
  if (abundance.trend)  {
    s2.fit <- squeezeVar(s2[present],
                         df = df.residual[present],
                         covariate = glmfit$abundance[present])
  }
  else {
    s2.fit <- squeezeVar(s2, df = df.residual)
  }
  F <- rep(0, length(s2))
  F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
  df.total <- s2.fit$df.prior + df.residual
  max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
  df.total <- min(df.total, length(s2) * max.df.residual)
  F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
                 log.p = FALSE)
  i <- s2.fit$var.post < 1
  if (any(i)) {
    chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
                           lower.tail = FALSE, log.p = FALSE)
    F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
  }
  out$table$LR <- out$table$PValue <- NULL
  out$table$F <- F
  out$table$PValue <- F.pvalue
  out$s2.fit <- s2.fit
  out$df.prior <- s2.fit$df.prior
  out$df.total <- df.total
  out
}




Secondly, I wanted to use edgeR's glmLRT and topTags function to generate nice tables for my DESeq results, so I wrote a few functions that allow me to use edgeR's glmFit function on a DESeq CountDataSet to yield a DGEGLM object just as if glmFit was called on a DGEList object. I have verified that these functions produce nearly identical fit coefficients to DESeq's own fitNbinomGLMs, and the resulting p-values from glmLRT are nearly identical to DESeq's nbinomGLMTest, so this accurately reproduces the DESeq method using edgeR functions. This allows you to use edgeR's topTags to get a nice table with PValues, FDR, logFC, logCPM, and annotations. I also tried glmQLFTest on the results, but it didn't change any of the p-values because the F-test derived p-values were all smaller so it just kept the chi-squared-derived p-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway, here are the functions:

## The following two functions allow the use to edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
  if (any(is.na(sizeFactors(y))))
    stop("Call estimateDispersions first")
  log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}

glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL, offset = NULL, weights = NULL, lib.size = NULL, start = NULL, method = "auto",
                                 ...)
{
  stopifnot(is(y, "CountDataSet"))
  if (is.null(dispersion)) {
    if ("disp_pooled" %in% colnames(fData(y)))
      dispersion <- fData(y)$disp_pooled
    else if ("disp_blind" %in% colnames(fData(y))) {
      if (fitInfo(y, "blind")$sharingMode != "fit-only")
warning("You have used 'method=\"blind\"' in estimateDispersion without also setting 'sharingMode=\"fit-only\"'. This will not yield useful results.")
      dispersion <- fData(y)$disp_blind
    }
else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or 'blind') first.")
  }
  if (is.null(offset) && is.null(lib.size))
    offset <- getOffset.CountDataSet(y)
  ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
  ## cause errors in glmFit. To fix this, we replace infinite
  ## dispersion estimates with the maximum representable floating
  ## point value, which should always result in a PValue of 1.
  infdisp <- !is.finite(dispersion)
  dispersion[infdisp] <- .Machine$double.xmax
fit <- glmFit.default(y = counts(y), design = design, dispersion = dispersion, offset = offset, weights = weights, lib.size = lib.size,
                        start = start, method = method, ...)
  ## Now set the dispersions back to infinite in the resulting fit object.
  fit$dispersion[infdisp] <- +Inf
  dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
  fit$counts <- counts(y)
  fit$samples <- pData(y)[-1]
fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind", "disp_pooled"))]
  new("DGEGLM", fit)
}

Note that while I have named the functions like S3 method, I haven't actually tested whether method dispatch works properly with them or not, since DESeq uses S4 classes. In my code, I call them by their full names instead of relying on method dispatch.

Please let me know if these seem useful to anyone. I've found them useful, because now I can produce edgeR and DESeq results in exactly the same format (via topTags).

Regards,

-Ryan Thompson

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to