[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

Gordon K Smyth Tue, 20 Nov 2012 16:26:42 -0800

Dear Ryan,

Thank you for raising the issue with all zero counts and glmQLFTest(). Iam not sure why you are running the code with all zero cases, since weadvise against it, but I agree that it would be better for the code tohandle such cases more gracefully.

Rather than posting your own fixes to bioc-devel, the best way to dealwith a bug in a Bioconductor package is to write to the package authorsand get it fixed in the package itself. Most authors are grateful to benotified of problems and will fix them asap. I have certainly respondedto you when you have written to me. Indeed I was the one who suggestedthat you try glmQLFTest in the first place. We invite users to send bugreports to us in Section 1.3 "how to get help" of the edgeR User's Guide.

On reason why creating a parallel universe of hacked functions is not agood idea is that packages change. As I told you a month ago, theglmQLFTest is under rapid development, and our non-public code has movedon in many ways since the version that you have hacked. In particular, wefixed the issue with all zero counts a while back. I admit that wehaven't yet committed our new versions to the Bioconductor repository.There are so many changes in the package that we wanted to completely testand debug everything before making them public.

Anyway, I will now give priority to commiting a correction to the officialBioconductor release.


Best wishes
Gordon

Date: Mon, 19 Nov 2012 18:08:04 -0800
From: "Ryan C. Thompson" <r...@thompsonclan.org>
To: bioc-devel@r-project.org
Subject: [Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

In my work, I've developed a few useful functions relating to edgeR and
DESeq. First, I wrote a version of glmQLFTest that does not throw errors
on zero-count genes. See the comment for details on what exactly it does:

## This version of glmQLFTest excludes genes with zero counts in all
## samples (logCPM == -Inf), because these cause squeezeVar to throw
## errors. Instead, it just arbitrarily sets their F statistics to
## zero and P-Values to 1, which is appropriate because obviously we
## cannot call differential expression if all counts are zero.
glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design),
contrast = NULL,
                             abundance.trend = TRUE)
{
  present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
  out <- glmLRT(glmfit, coef = coef, contrast = contrast)
  df.residual <- glmfit$df.residual
  s2 <- glmfit$deviance/df.residual
  df.residual[s2 < 1e-14] <- 0
  s2 <- pmax(s2, 0)
  if (abundance.trend)  {
    s2.fit <- squeezeVar(s2[present],
                         df = df.residual[present],
                         covariate = glmfit$abundance[present])
  }
  else {
    s2.fit <- squeezeVar(s2, df = df.residual)
  }
  F <- rep(0, length(s2))
  F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
  df.total <- s2.fit$df.prior + df.residual
  max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
  df.total <- min(df.total, length(s2) * max.df.residual)
  F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
                 log.p = FALSE)
  i <- s2.fit$var.post < 1
  if (any(i)) {
    chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
                           lower.tail = FALSE, log.p = FALSE)
    F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
  }
  out$table$LR <- out$table$PValue <- NULL
  out$table$F <- F
  out$table$PValue <- F.pvalue
  out$s2.fit <- s2.fit
  out$df.prior <- s2.fit$df.prior
  out$df.total <- df.total
  out
}

Secondly, I wanted to use edgeR's glmLRT and topTags function to
generate nice tables for my DESeq results, so I wrote a few functions
that allow me to use edgeR's glmFit function on a DESeq CountDataSet to
yield a DGEGLM object just as if glmFit was called on a DGEList object.
I have verified that these functions produce nearly identical fit
coefficients to DESeq's own fitNbinomGLMs, and the resulting p-values
from glmLRT are nearly identical to DESeq's nbinomGLMTest, so this
accurately reproduces the DESeq method using edgeR functions. This
allows you to use edgeR's topTags to get a nice table with PValues, FDR,
logFC, logCPM, and annotations. I also tried glmQLFTest on the results,
but it didn't change any of the p-values because the F-test derived
p-values were all smaller so it just kept the chi-squared-derived
p-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway,
here are the functions:

## The following two functions allow the use to edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
  if (any(is.na(sizeFactors(y))))
    stop("Call estimateDispersions first")
  log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}

glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL,
offset = NULL,
                                 weights = NULL, lib.size = NULL, start
= NULL, method = "auto",
                                 ...)
{
  stopifnot(is(y, "CountDataSet"))
  if (is.null(dispersion)) {
    if ("disp_pooled" %in% colnames(fData(y)))
      dispersion <- fData(y)$disp_pooled
    else if ("disp_blind" %in% colnames(fData(y))) {
      if (fitInfo(y, "blind")$sharingMode != "fit-only")
        warning("You have used 'method=\"blind\"' in estimateDispersion
without also setting 'sharingMode=\"fit-only\"'. This will not yield
useful results.")
      dispersion <- fData(y)$disp_blind
    }
    else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or
'blind') first.")
  }
  if (is.null(offset) && is.null(lib.size))
    offset <- getOffset.CountDataSet(y)
  ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
  ## cause errors in glmFit. To fix this, we replace infinite
  ## dispersion estimates with the maximum representable floating
  ## point value, which should always result in a PValue of 1.
  infdisp <- !is.finite(dispersion)
  dispersion[infdisp] <- .Machine$double.xmax
  fit <- glmFit.default(y = counts(y), design = design, dispersion =
dispersion,
                        offset = offset, weights = weights, lib.size =
lib.size,
                        start = start, method = method, ...)
  ## Now set the dispersions back to infinite in the resulting fit object.
  fit$dispersion[infdisp] <- +Inf
  dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
  fit$counts <- counts(y)
  fit$samples <- pData(y)[-1]
  fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind",
"disp_pooled"))]
  new("DGEGLM", fit)
}

Note that while I have named the functions like S3 method, I haven't
actually tested whether method dispatch works properly with them or not,
since DESeq uses S4 classes. In my code, I call them by their full names
instead of relying on method dispatch.

Please let me know if these seem useful to anyone. I've found them
useful, because now I can produce edgeR and DESeq results in exactly the
same format (via topTags).

Regards,

-Ryan Thompson


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

Reply via email to