While I'm on this point, there's another, more subtle issue with using sparseMatrix(). Specifically, there's a distinction between zeros and missing values when considering a ContactMatrix. For example, in Hi-C data, a zero in the matrix means there aren't any read pairs mapping between the corresponding bins. A missing value means that the count for the bin pair is unknown, e.g., because that particular pairwise interaction was missing from the InteractionSet during conversion.

This difference may be important in calculating correct statistics; one can imagine situations where assuming all missing values are zero would not be appropriate. In general, I would expect that missing values would take up most of the matrix entries after conversion from an InteractionSet. sparseMatrix() doesn't seem to support setting "NA" as the default value to collapse a sparse matrix; it's fixed at zero, which makes mathematical sense but isn't quite right for our purposes.

Now, this might not be so bad for count data, depending on how you counted the reads into bin pairs; converting all NA's to zeros might be okay in such circumstances, if the occurrence of those NA's in the first place was due to the lack of reads. However, if you fill the contact matrix with other metrics (e.g., log-FCs, average log-CPMs), assuming that all missing values are zero would probably be incorrect.

Anyway, food for thought.

- Aaron

On 16/11/15 10:31, Aaron Lun wrote:
Thanks for the comment Nadhir.

I had considered the use of a sparse matrix class. The reason I didn't
implement it originally is because truly sparse interaction data would
be better represented by just working with the pairwise format in the
InteractionSet. You need the row/column indices to pass to the
sparseMatrix constructor anyway; a memory-efficient algorithm to do, for
example, compartment identification could just use that directly.

Most existing algorithms for doing this (e.g., k-means/hierarchical
clustering) won't operate natively from a sparseMatrix, and I suspect
they'll just run as.matrix() and convert it to a full matrix. Obviously,
this would defeat the purpose of using a sparse matrix. So, if you have
to rewrite the algorithms anyway, you might as well rewrite them in a
manner that avoids needing the sparseMatrix() as a middleman.

Nonetheless, it's a good point about memory usage. I'll have a think
about it; sparseMatrix() would help a bit, but as coverage increases for
these experiments, the matrix will probably become fairly dense (even if
it's just counts of 1 for some bin pairs). Even now, for compartment
detection, fairly large bins are involved that sparseness usually isn't
observed. Perhaps big.matrix() might be a better choice.

Cheers,

Aaron


On 16/11/15 09:58, DJEKIDEL MOHAMED NADHIR wrote:
Hi Aaron,

Sounds as a great initiative.
I just have some comments about the ContactMatrix-Class.

I think with increasing Hi-C resolution the usage of the matrix class
will consume a lot of memory.
Maybe using sparseMatrix from the Matrix package has a smaller finger
print.

it can also be manipulated in cpp using  RcppEigen, if for example you
plan some functionalities such as AB domains or insulation scores, ...
etc.

Regards,

- Nadhir

On Mon, Nov 16, 2015 at 5:33 PM, Aaron Lun <a...@wehi.edu.au
<mailto:a...@wehi.edu.au>> wrote:

    Hello all,

    I thought I might give an update on the state of affairs for the
    InteractionSet package. Currently, there's three classes:

    - the GInteractions class, inheriting from Vector and intended to
    represent pairwise interactions between genomic regions (based on
    suggestions from Malcolm Perry and Liz Ing-Simmons).

    - the InteractionSet class, inheriting from SummarizedExperiment0
    and containing a GInteractions object; intended to store
    experimental data about pairwise interactions (one interaction per
row).

    - the ContactMatrix class, inheriting from Annotated and storing
    data in matrix form (where rows/columns represent genomic regions).

    Getters, setters, conversion methods between classes, distance
    calculation methods and overlap methods have been implemented. Man
    pages and "testthat" scripts have also been written. Still missing a
    vignette, though it should be easy enough to write one.

    All in all, I think it's a solid first draft. Any comments would be
    appreciated.

    Cheers,

    Aaron

    On 08/11/15 19:31, Aaron Lun wrote:

        Okay, some meat and bones are on GitHub now:

        https://github.com/LTLA/InteractionSet

        The idea is to represent genomic interactions as pairs of genomic
        regions, using indices to point to a common GRanges object (a la
        Hits,
        though I haven't used that explicitly due to the presence of
        additional
        constraints on the indices). Data for each interaction is stored
        using a
        SummarizedExperiment framework (one row per interaction).

        With regards to the methods, most of the low-hanging fruit has
been
        implemented, courtesy of inheriting from SummarizedExperiment0.
        I'll add
        proper unit tests over the coming week. It currently passes
        through R
        CMD check okay, except for a warning about ":::" in the
cbind/rbind
        definitions (callNextMethod() didn't seem to work inside those
        methods,
        and I didn't want to rewrite the SE0 'binding methods).

        Any thoughts appreciated.

        - Aaron

        On 07/11/15 19:33, Morgan, Martin wrote:

            Just to say that this is a great idea. If this starts as a
            github
            package (or in svn, we can create a location for you if
            you'd like) I
            and others would I am sure be happy to try to provide any
            guidance /
            insight. The main design principles are probably to reuse as
            much as
            possible from existing classes, especially the S4Vectors /
            GRanges
            world, and to integrate metadata as appropriate (like
            SummarizedExepriment, for instance).

            Martin
            ________________________________________
            From: Bioc-devel [bioc-devel-boun...@r-project.org
            <mailto:bioc-devel-boun...@r-project.org>] on behalf of Aaron
            Lun [a...@wehi.edu.au <mailto:a...@wehi.edu.au>]
            Sent: Thursday, November 05, 2015 12:27 PM
            To: bioc-devel@r-project.org
<mailto:bioc-devel@r-project.org>
            Subject: Re: [Bioc-devel] Base class for interaction data -
            expressions of      interest

            There's a growing number of Bioconductor packages dealing
with
            interaction data; diffHic, GenomicInteractions, HiTC, to
            name a few (and
            probably more in the future). Each of these packages defines
            its own
            class to store interaction data - DIList for diffHic,
            GenomicInteractions for GenomicInteractions, and HTClist for
            HiTC.

            These classes seem to share a lot of features, which
            suggests that they
            can be (easily?) replaced with a common class. This would
            have two
            advantages - one, developers of new and existing packages
            don't have to
            continually write and maintain new classes; and two, it
            provides users
            with a consistent user experience across the relevant
packages.

            My question is, does anybody have anything in the pipeline
            with respect
            to a base package for an interaction class? If not, I'm
            planning to put
            something together for the next BioC release. To this end,
            I'd welcome
            any ideas/input/code; the aim is to make a drop-in
            replacement (insofar
            as that's possible) for the existing classes in each package.

            Cheers,

            Aaron

            _______________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/listinfo/bioc-devel


            This email message may contain legally privileged and/or
            confidential
            information.  If you are not the intended recipient(s), or
the
            employee or agent responsible for the delivery of this
            message to the
            intended recipient(s), you are hereby notified that any
            disclosure,
            copying, distribution, or use of this email message is
            prohibited.  If
            you have received this message in error, please notify the
            sender
            immediately by e-mail and delete this email message from your
            computer. Thank you.


        _______________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/listinfo/bioc-devel


    _______________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel



_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to