Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Michael Lawrence Tue, 17 Sep 2013 10:16:43 -0700

When I hit this I usually just do:

seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))


It would be nice if there were a function for sorting chromosome names
according to a specified naming convention. Like sortSeqlevels(gr,
UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.

Michael






On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
<st...@channing.harvard.edu>wrote:

> I am replying to this as Michael mentions that this is a powerful
> operation.  Here's my
> problem that I believe is related, but I do not see a straightforward
> solution
>
> > bhmm19uni
> GRanges with 559456 ranges and 4 metadata columns:
>          seqnames                 ranges strand   |              name
> score
>             <Rle>              <IRanges>  <Rle>   |       <character>
> <numeric>
>        1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
>     0
>        2     chr1 [203057955, 203060154]      *   |      12_Repressed
>     0
>        3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
>     0
>        4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
>     0
>        5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
>     0
>      ...      ...                    ...    ... ...               ...
>   ...
>   570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
>     0
>   570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
>     0
>   570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
>     0
>   570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
>     0
>   570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
>     0
>                           thick     numcode
>                       <IRanges> <character>
>        1 [ 67001613,  67113812]          13
>        2 [201324578, 201326777]          12
>        3 [  8381014,   8408813]          13
>        4 [ 16777014,  16777413]           5
>        5 [ 25162014,  25166013]          11
>      ...                    ...         ...
>   570766   [49447797, 49448196]           2
>   570767   [49448197, 49448396]           6
>   570768   [49448397, 49448596]           2
>   570769   [49448597, 49524996]          13
>   570770   [49524997, 49525196]          15
>   ---
>   seqlengths:
>         chr1     chr10     chr11      chr5 ...      chr2     chr21
>  chr4
>    249250621 135534747 135006516 180915260 ... 243199373  48129895
> 191154276
>
> If I try to make a karyogram with ggbio, the chromosomes are in an
> unnatural order.  What is the right approach to getting the seqinfo in a
> natural order with respect to chromosome names and lengths?  Reassigning
> seqlevels seems dangerous but I haven't experimented fully.  It must be a
> common use case but I do not see it in doc.
>
>
>
> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
> lawrence.mich...@gene.com> wrote:
>
>> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:
>>
>> > Michael,
>> >
>> >
>> > On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>> >
>> >>
>> >>
>> >>
>> >> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org
>> >> <mailto:hpa...@fhcrc.org>> wrote:
>> >>
>> >>     Hi,
>> >>
>> >>
>> >>     On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>> >>
>> >>         On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>> >>         <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote:
>> >>
>> >>             Hi Michael --
>> >>
>> >>
>> >>             On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>> >>
>> >>                 While it's cool that seqlevels<- has become so
>> flexible,
>> >>                 I still claim
>> >>                 that
>> >>                 rename/keep/drop would be a lot easier to read, because
>> >>                 they describe the
>> >>                 high-level operation, and there's no reason to
>> deprecate
>> >>                 them. They're
>> >>                 also
>> >>                 easier to remember. For example, for dropping with
>> >>                 seqlevels<-, the user
>> >>                 needs
>> >>                 to remember that "force" is necessary to drop. Much
>> >>                 easier to just say
>> >>                 "dropSeqlevels(), please". And reimplementing
>> >>                 keepSeqlevels is still too
>> >>                 complicated. Is it such a maintenance burden to have
>> >>                 these simple
>> >>                 wrappers sit
>> >>                 on top of the low-level manipulators?
>> >>
>> >>                 Another suggestion: renameSeqlevels should not require
>> a
>> >>                 named vector (in
>> >>                 cases
>> >>                 where it is unnamed, require that it is parallel to the
>> >>                 existing
>> >>                 seqlevels, as
>> >>                 with seqlevels<-).
>> >>
>> >>
>> >>             I didn't really indicate what drove my desire to see
>> >>             keepSeqlevels
>> >>             deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
>> were
>> >>             developed more
>> >>             or less independently, and have different contracts. The
>> >>             different
>> >>             contracts are confusing to the user, as is creating a
>> usable
>> >>             help system
>> >>             when there are multiple end points for what is a common
>> >>             operation. The help
>> >>             pages of each were inadequate in different ways. And
>> because
>> >>             the code was
>> >>             developed independently, support for different objects was
>> >>             inconsistent. So
>> >>             actually, yes, the maintenance (and use) burden for the
>> >>             previous state of
>> >>             affairs was substantial.
>> >>
>> >>             On the other hand, I agree that keepSeqlevels is convenient
>> >>             as a simple
>> >>             wrapper around seqlevels<-, in the same way that setNames
>> >>             and names<- are
>> >>             both useful.
>> >>
>> >>             So we could iterate to
>> >>
>> >>                 keepSeqlevels <-
>> >>                     function(x, value, ...)
>> >>                 {
>> >>                     seqlevels(x, force=TRUE) <- value
>> >>                     x
>> >>                 }
>> >>
>> >>             but I'd be less enthusiastic about maintaining the original
>> >>             contract of
>> >>             keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
>> >>             worked for two
>> >>             GRanges objects.
>> >>
>> >>
>> >>     Why would this be called keepSeqlevels() if, depending on how it's
>> >> used,
>> >>     it will either add, drop, rename, or permute the seqlevels?
>> >>     Couldn't this be called setSeqlevels?
>> >>
>> >>
>> >> I thought that new2old was necessary for permuting.
>> >>
>> >
>> > The seqlevels setter has no 'new2old' arg.
>> >
>> >
>> >  You're right that
>> >> adding should be disallowed for keepSeqlevels(). Adding seqlevels is
>> not
>> >> a common operation. The two common operations are:
>> >> * Permuting, usually because the data were imported in non-canonical
>> >> order (seqnameStyle was designed to address this, no?).
>> >>
>> >
>> > Permuting is straightforward with seqlevels<-:
>> >
>> >   > seqlevels(gr)
>> >   [1] "chr1" "chr2" "chr3"
>> >
>> >   > seqlevels(gr) <- rev(seqlevels(gr))
>> >
>> >   > seqlevels(gr)
>> >   [1] "chr3" "chr2" "chr1"
>> >
>> >
>> >  * Subsetting, either by keeping or dropping.
>> >>
>> >
>> > Also straightforward:
>> >
>> >   > seqlevels(gr, force=TRUE) <- "chr2"
>> >
>> >   > seqlevels(gr)
>> >   [1] "chr2"
>> >
>> >
>> >  Two main reasons: taking a
>> >> small slice of the data for prototyping, or removing problematic
>> >> chromosomes (sex, circular). This goes back to bsapply and the exclude
>> >> argument.
>> >>
>> >
>> > There are other important use cases:
>> >
>> >   - Dropping seqlevels that are *not* in use. This happens for example
>> >     when subsetting a BAM file with Rsamtools::filterBam to keep only
>> >     the alignments located on a given chromosome. The entire sequence
>> >     dictionary (in the header) is propagated to the resulting BAM file
>> >     so when you read the file with readGAlignments() you end up with a
>> >     bunch of useless seqlevels that you generally start by dropping.
>> >     You don't want to drop any alignment so force=FALSE is your friend.
>> >
>> >
>> This is sort of tangential to the discussion, but do you really want to do
>> this? I would preserve the universe as given by the BAM.
>>
>>
>> >   - An even more common operation is renaming: 90% of the times that's
>> >     what I use seqlevels<- for, and that's what I tell people to use
>> >     when they need to rename their chromosomes. Also straightforward:
>> >
>> >       > seqlevels(gr)
>> >       [1] "chr1" "chr2" "chr3" "chrM"
>> >
>> >       > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>> >       > seqlevels(gr)
>> >       [1] "1" "2" "3" "M"
>> >
>> >       > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>> >       > seqlevels(gr)
>> >       [1] "1"  "2"  "3"  "MT"
>> >
>> >     Note that this is simpler to use than renameSeqlevels() which
>> >     always required you to build the entire right value as a named
>> >     vector.
>> >
>> >
>> Sure, renaming is a common use case. Not sure how I forgot that.
>>
>> As I wrote earlier in the thread, renameSeqlevels should be changed so as
>> not to require naming of the vector.
>>
>>
>> > So the added-value of keepSeqlevels() seems to boil down to:
>> >
>> >   (a) it always uses force=TRUE
>> >
>> >   (b) it preserves the original object and returns the modified object
>> >
>> > If you want to restrict the use of keepSeqlevels() to permuting and
>> > dropping, you'll also have to disallow renaming (in addition to adding).
>> > Then its name will more or less reflect what it does (if "keep" means
>> > "permute" or "drop"). The final result will be something that does less
>> > than setSeqlevels() and that is not necessarily easier to read: both
>> > will set on the object whatever vector of seqlevels they are supplied,
>> > except that one will fail when doing so would actually result in adding
>> > or renaming seqlevels.
>> >
>> >
>> So it looks like seqlevels<- is pretty powerful. Now I see that if I
>> scroll
>> down to the examples in the man page, I find the actual documentation.
>> It's
>> cool to learn that we can replace seqlevels on TxDb objects. My argument
>> has always been that since these low-level operations are so powerful,
>> it's
>> nice to have high-level operations to clarify the code. We have to think
>> hard to know what the RHS will do in that replacement. It's probably one
>> of
>> the most complex replacement operations; far more complex than the typical
>> one, including levels<- on factor. There's nothing wrong with that; it's
>> just complexity that we should be able to hide.
>>
>> Michael
>>
>> H.
>> >
>> >
>> >> Michael
>> >>
>> >>
>> >>     H.
>> >>
>> >>
>> >>
>> >>         Well, I wasn't even aware of that feature, so you've made your
>> >>         point about
>> >>         the documentation ;) Sounds like a good compromise to me.
>> >>
>> >>         Thanks for understanding,
>> >>         Michael
>> >>
>> >>
>> >>
>> >>             Martin
>> >>
>> >>
>> >>                 Michael
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>                 On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>> >>                 <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
>> >>                 <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org
>> >>>
>> >>
>> >>                 wrote:
>> >>
>> >>                       On 05/18/2013 05:42 PM, Martin Morgan wrote:
>> >>
>> >>                           Some of the most common operations are
>> >>                 straight-forward, e.g.,
>> >>
>> >>                                 > gr = GRanges(paste0("chr", c(1:22,
>> >>                 "X", "Y")), IRanges(1,
>> >>                 100))
>> >>                                 > seqlevels(gr) = sub("chr", "ch",
>> >>                 seqlevels(gr))
>> >>
>> >>
>> >>                       To get a more comprehensive example I should have
>> >>                 followed my own
>> >>                 advice and
>> >>                       grabbed from the help page!
>> >>
>> >>                             ## Rename:
>> >>                             seqlevels(txdb) <- sub("chr", "",
>> >>                 seqlevels(txdb))
>> >>                             seqlevels(txdb)
>> >>
>> >>                             seqlevels(txdb) <- paste0("CH",
>> >> seqlevels(txdb))
>> >>                             seqlevels(txdb)
>> >>
>> >>                             seqlevels(txdb)[seqlevels(__****__txdb) ==
>>
>> >>
>> >>                 "CHM"] <- "M"
>> >>
>> >>
>> >>                             seqlevels(txdb)
>> >>
>> >>
>> >>                       --
>> >>                       Computational Biology / Fred Hutchinson Cancer
>> >>                 Research Center
>> >>                       1100 Fairview Ave. N.
>> >>                       PO Box 19024 Seattle, WA 98109
>> >>
>> >>                       Location: Arnold Building M1 B861
>> >>                       Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>> >>                 <tel:%28206%29%20667-2793>
>> >>
>> >>
>> >>
>> >>
>> >>             --
>> >>             Dr. Martin Morgan, PhD
>> >>
>> >>             Fred Hutchinson Cancer Research Center
>> >>             1100 Fairview Ave. N.
>> >>             PO Box 19024 Seattle, WA 98109
>> >>
>> >>
>> >>                  [[alternative HTML version deleted]]
>> >>
>> >>         ______________________________**___________________
>> >>         Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.**org<
>> Bioc-devel@r-project.org>
>> >> >
>> >>         mailing list
>> >>         https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>> >>
>> >>         <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>> >> >
>> >>
>> >>
>> >>     --
>> >>     Hervé Pagès
>> >>
>> >>     Program in Computational Biology
>> >>     Division of Public Health Sciences
>> >>
>> >>     Fred Hutchinson Cancer Research Center
>> >>     1100 Fairview Ave. N, M1-B514
>> >>     P.O. Box 19024
>> >>     Seattle, WA 98109-1024
>> >>
>> >>     E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
>> >>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> >>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>> >>
>> >>
>> >>
>> > --
>> > Hervé Pagès
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpa...@fhcrc.org
>> > Phone:  (206) 667-5791
>> > Fax:    (206) 667-1319
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Reply via email to