I have been doing this in a kludgey fashion for a long long time, and would favor restoreNaturalOrder() as a clear name for what it does. It would be great to have this in BioC-2.13 release.
Thanks Michael and Herve and Kasper for bringing up and solving this cleanly. --t On Sep 17, 2013, at 11:30 AM, Michael Lawrence <lawrence.mich...@gene.com> wrote: > Sounds great. Could we have a more intuitive name? Like naturalOrder(). And > then the method to fix the GRanges could be restoreNaturalOrder(). Just > some suggestions... > > > > > On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpa...@fhcrc.org> wrote: > >> For sorting the chromosome names in "natural" order, you can use >> the makeSeqnameIds() helper function in GenomicRanges: >> >> seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(**seqlevels(gr))] >> >> It recognizes several naming conventions (chr-prefixed or not, >> roman, etc...) e.g.: >> >>> makeSeqnameIds(c("Y", "1", "9", "M", "2")) >> [1] 4 1 3 5 2 >> >>> makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII")) >> [1] 4 1 3 5 2 >> >> I still need to improve the documentation of this function, put >> more examples, and add unit tests. It's used internally by the >> makeTranscriptDb* functions in GenomicFeatures to assign internal >> ids to the chromosomes so that all the TranscriptDb extractors >> can then return GRanges and GRangesList objects with the seqlevels >> in the "natural" order. >> >> Cheers, >> H. >> >> >> >> On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote: >> >>> Actually, what I (and I think several others) just want is >>> fixSeqnames() >>> which sorts and fixes them to some convention. I don't really care which >>> one, but I want to do some easy harmonization, preferably in line with >>> other bioc tools. I know this is hard to do for all organisms, but having >>> something that deals with the major ones (human, mouse, fruit fly, yeast, >>> some monkeys) would be extremely valuable. If this function existed, I >>> could just throw all my GRanges at it. >>> >>> Kasper >>> >>> >>> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence < >>> lawrence.mich...@gene.com >>> >>>> wrote: >>> >>> When I hit this I usually just do: >>>> >>>> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y")) >>>> >>>> It would be nice if there were a function for sorting chromosome names >>>> according to a specified naming convention. Like sortSeqlevels(gr, >>>> UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <- >>>> sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch. >>>> >>>> Michael >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey >>>> <st...@channing.harvard.edu>**wrote: >>>> >>>> I am replying to this as Michael mentions that this is a powerful >>>>> operation. Here's my >>>>> problem that I believe is related, but I do not see a straightforward >>>>> solution >>>>> >>>>> bhmm19uni >>>>> GRanges with 559456 ranges and 4 metadata columns: >>>>> seqnames ranges strand | name >>>>> score >>>>> <Rle> <IRanges> <Rle> | <character> >>>>> <numeric> >>>>> 1 chr1 [ 67229025, 67341224] * | 13_Heterochrom/lo >>>>> 0 >>>>> 2 chr1 [203057955, 203060154] * | 12_Repressed >>>>> 0 >>>>> 3 chr1 [ 8458427, 8486226] * | 13_Heterochrom/lo >>>>> 0 >>>>> 4 chr1 [ 16904427, 16904826] * | 5_Strong_Enhancer >>>>> 0 >>>>> 5 chr1 [ 25289427, 25293426] * | 11_Weak_Txn >>>>> 0 >>>>> ... ... ... ... ... ... >>>>> ... >>>>> 570766 chr22 [51100931, 51101330] * | 2_Weak_Promoter >>>>> 0 >>>>> 570767 chr22 [51101331, 51101530] * | 6_Weak_Enhancer >>>>> 0 >>>>> 570768 chr22 [51101531, 51101730] * | 2_Weak_Promoter >>>>> 0 >>>>> 570769 chr22 [51101731, 51178130] * | 13_Heterochrom/lo >>>>> 0 >>>>> 570770 chr22 [51178131, 51178330] * | 15_Repetitive/CNV >>>>> 0 >>>>> thick numcode >>>>> <IRanges> <character> >>>>> 1 [ 67001613, 67113812] 13 >>>>> 2 [201324578, 201326777] 12 >>>>> 3 [ 8381014, 8408813] 13 >>>>> 4 [ 16777014, 16777413] 5 >>>>> 5 [ 25162014, 25166013] 11 >>>>> ... ... ... >>>>> 570766 [49447797, 49448196] 2 >>>>> 570767 [49448197, 49448396] 6 >>>>> 570768 [49448397, 49448596] 2 >>>>> 570769 [49448597, 49524996] 13 >>>>> 570770 [49524997, 49525196] 15 >>>>> --- >>>>> seqlengths: >>>>> chr1 chr10 chr11 chr5 ... chr2 chr21 >>>>> chr4 >>>>> 249250621 135534747 135006516 180915260 ... 243199373 48129895 >>>>> 191154276 >>>>> >>>>> If I try to make a karyogram with ggbio, the chromosomes are in an >>>>> unnatural order. What is the right approach to getting the seqinfo in a >>>>> natural order with respect to chromosome names and lengths? Reassigning >>>>> seqlevels seems dangerous but I haven't experimented fully. It must be >>>>> a >>>>> common use case but I do not see it in doc. >>>>> >>>>> >>>>> >>>>> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence < >>>>> lawrence.mich...@gene.com> wrote: >>>>> >>>>> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org> >>>>>> wrote: >>>>>> >>>>>> Michael, >>>>>>> >>>>>>> >>>>>>> On 05/20/2013 08:44 PM, Michael Lawrence wrote: >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org >>>>>>>> <mailto:hpa...@fhcrc.org>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> On 05/20/2013 03:15 PM, Michael Lawrence wrote: >>>>>>>> >>>>>>>> On Mon, May 20, 2013 at 3:11 PM, Martin Morgan >>>>>>>> <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote: >>>>>>>> >>>>>>>> Hi Michael -- >>>>>>>> >>>>>>>> >>>>>>>> On 5/20/2013 5:56 AM, Michael Lawrence wrote: >>>>>>>> >>>>>>>> While it's cool that seqlevels<- has become so >>>>>>> flexible, >>>>>> >>>>>>> I still claim >>>>>>>> that >>>>>>>> rename/keep/drop would be a lot easier to read, >>>>>>> because >>>> >>>>> they describe the >>>>>>>> high-level operation, and there's no reason to >>>>>>> deprecate >>>>>> >>>>>>> them. They're >>>>>>>> also >>>>>>>> easier to remember. For example, for dropping with >>>>>>>> seqlevels<-, the user >>>>>>>> needs >>>>>>>> to remember that "force" is necessary to drop. Much >>>>>>>> easier to just say >>>>>>>> "dropSeqlevels(), please". And reimplementing >>>>>>>> keepSeqlevels is still too >>>>>>>> complicated. Is it such a maintenance burden to have >>>>>>>> these simple >>>>>>>> wrappers sit >>>>>>>> on top of the low-level manipulators? >>>>>>>> >>>>>>>> Another suggestion: renameSeqlevels should not >>>>>>> require >>>> >>>>> a >>>>>> >>>>>>> named vector (in >>>>>>>> cases >>>>>>>> where it is unnamed, require that it is parallel to >>>>>>> the >>>> >>>>> existing >>>>>>>> seqlevels, as >>>>>>>> with seqlevels<-). >>>>>>>> >>>>>>>> >>>>>>>> I didn't really indicate what drove my desire to see >>>>>>>> keepSeqlevels >>>>>>>> deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq >>>>>>> were >>>>>> >>>>>>> developed more >>>>>>>> or less independently, and have different contracts. The >>>>>>>> different >>>>>>>> contracts are confusing to the user, as is creating a >>>>>>> usable >>>>>> >>>>>>> help system >>>>>>>> when there are multiple end points for what is a common >>>>>>>> operation. The help >>>>>>>> pages of each were inadequate in different ways. And >>>>>>> because >>>>>> >>>>>>> the code was >>>>>>>> developed independently, support for different objects >>>>>>> was >>>> >>>>> inconsistent. So >>>>>>>> actually, yes, the maintenance (and use) burden for the >>>>>>>> previous state of >>>>>>>> affairs was substantial. >>>>>>>> >>>>>>>> On the other hand, I agree that keepSeqlevels is >>>>>>> convenient >>>> >>>>> as a simple >>>>>>>> wrapper around seqlevels<-, in the same way that >>>>>>>> setNames >>>>>>>> and names<- are >>>>>>>> both useful. >>>>>>>> >>>>>>>> So we could iterate to >>>>>>>> >>>>>>>> keepSeqlevels <- >>>>>>>> function(x, value, ...) >>>>>>>> { >>>>>>>> seqlevels(x, force=TRUE) <- value >>>>>>>> x >>>>>>>> } >>>>>>>> >>>>>>>> but I'd be less enthusiastic about maintaining the >>>>>>> original >>>> >>>>> contract of >>>>>>>> keepSeqlevels, where keepSeqlevels(gr1, gr2), would have >>>>>>>> worked for two >>>>>>>> GRanges objects. >>>>>>>> >>>>>>>> >>>>>>>> Why would this be called keepSeqlevels() if, depending on how >>>>>>> it's >>>> >>>>> used, >>>>>>>> it will either add, drop, rename, or permute the seqlevels? >>>>>>>> Couldn't this be called setSeqlevels? >>>>>>>> >>>>>>>> >>>>>>>> I thought that new2old was necessary for permuting. >>>>>>> The seqlevels setter has no 'new2old' arg. >>>>>>> >>>>>>> >>>>>>> You're right that >>>>>>> >>>>>>>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is >>>>>>> not >>>>>> >>>>>>> a common operation. The two common operations are: >>>>>>>> * Permuting, usually because the data were imported in non-canonical >>>>>>>> order (seqnameStyle was designed to address this, no?). >>>>>>> Permuting is straightforward with seqlevels<-: >>>>>>> >>>>>>>> seqlevels(gr) >>>>>>> [1] "chr1" "chr2" "chr3" >>>>>>> >>>>>>>> seqlevels(gr) <- rev(seqlevels(gr)) >>>>>>> >>>>>>>> seqlevels(gr) >>>>>>> [1] "chr3" "chr2" "chr1" >>>>>>> >>>>>>> >>>>>>> * Subsetting, either by keeping or dropping. >>>>>>> >>>>>>> Also straightforward: >>>>>>> >>>>>>>> seqlevels(gr, force=TRUE) <- "chr2" >>>>>>> >>>>>>>> seqlevels(gr) >>>>>>> [1] "chr2" >>>>>>> >>>>>>> >>>>>>> Two main reasons: taking a >>>>>>> >>>>>>>> small slice of the data for prototyping, or removing problematic >>>>>>>> chromosomes (sex, circular). This goes back to bsapply and the >>>>>>> exclude >>>> >>>>> argument. >>>>>>> There are other important use cases: >>>>>>> >>>>>>> - Dropping seqlevels that are *not* in use. This happens for >>>>>>> example >>>>>>> when subsetting a BAM file with Rsamtools::filterBam to keep only >>>>>>> the alignments located on a given chromosome. The entire sequence >>>>>>> dictionary (in the header) is propagated to the resulting BAM >>>>>>> file >>>>>>> so when you read the file with readGAlignments() you end up with >>>>>>> a >>>>>>> bunch of useless seqlevels that you generally start by dropping. >>>>>>> You don't want to drop any alignment so force=FALSE is your >>>>>> friend. >>>> >>>>> >>>>>>> >>>>>>> This is sort of tangential to the discussion, but do you really want >>>>>> to >>>>> do >>>> >>>>> this? I would preserve the universe as given by the BAM. >>>>>> >>>>>> >>>>>> - An even more common operation is renaming: 90% of the times >>>>>>> that's >>>>>>> what I use seqlevels<- for, and that's what I tell people to use >>>>>>> when they need to rename their chromosomes. Also straightforward: >>>>>>> >>>>>>>> seqlevels(gr) >>>>>>> [1] "chr1" "chr2" "chr3" "chrM" >>>>>>> >>>>>>>> seqlevels(gr) <- sub("^chr", "", seqlevels(gr)) >>>>>>>> seqlevels(gr) >>>>>>> [1] "1" "2" "3" "M" >>>>>>> >>>>>>>> seqlevels(gr)[seqlevels(gr) == "M"] <- "MT" >>>>>>>> seqlevels(gr) >>>>>>> [1] "1" "2" "3" "MT" >>>>>>> >>>>>>> Note that this is simpler to use than renameSeqlevels() which >>>>>>> always required you to build the entire right value as a named >>>>>>> vector. >>>>>>> >>>>>>> >>>>>>> Sure, renaming is a common use case. Not sure how I forgot that. >>>>>> >>>>>> As I wrote earlier in the thread, renameSeqlevels should be changed so >>>>> as >>>> >>>>> not to require naming of the vector. >>>>>> >>>>>> >>>>>> So the added-value of keepSeqlevels() seems to boil down to: >>>>>>> >>>>>>> (a) it always uses force=TRUE >>>>>>> >>>>>>> (b) it preserves the original object and returns the modified >>>>>>> object >>>>>>> >>>>>>> If you want to restrict the use of keepSeqlevels() to permuting and >>>>>>> dropping, you'll also have to disallow renaming (in addition to >>>>>> adding). >>>> >>>>> Then its name will more or less reflect what it does (if "keep" means >>>>>>> "permute" or "drop"). The final result will be something that does >>>>>> less >>>> >>>>> than setSeqlevels() and that is not necessarily easier to read: both >>>>>>> will set on the object whatever vector of seqlevels they are supplied, >>>>>>> except that one will fail when doing so would actually result in >>>>>> adding >>>> >>>>> or renaming seqlevels. >>>>>>> >>>>>>> >>>>>>> So it looks like seqlevels<- is pretty powerful. Now I see that if I >>>>>> scroll >>>>>> down to the examples in the man page, I find the actual documentation. >>>>>> It's >>>>>> cool to learn that we can replace seqlevels on TxDb objects. My >>>>>> argument >>>>>> has always been that since these low-level operations are so powerful, >>>>>> it's >>>>>> nice to have high-level operations to clarify the code. We have to >>>>>> think >>>>>> hard to know what the RHS will do in that replacement. It's probably >>>>>> one >>>>>> of >>>>>> the most complex replacement operations; far more complex than the >>>>> typical >>>> >>>>> one, including levels<- on factor. There's nothing wrong with that; it's >>>>>> just complexity that we should be able to hide. >>>>>> >>>>>> Michael >>>>>> >>>>>> H. >>>>>> >>>>>>> >>>>>>> >>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> H. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Well, I wasn't even aware of that feature, so you've made >>>>>>> your >>>> >>>>> point about >>>>>>>> the documentation ;) Sounds like a good compromise to me. >>>>>>>> >>>>>>>> Thanks for understanding, >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Martin >>>>>>>> >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 18, 2013 at 6:00 PM, Martin Morgan >>>>>>>> <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org> >>>>>>>> <mailto:mtmor...@fhcrc.org <mailto: >>>>>>> mtmor...@fhcrc.org >>>> >>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> On 05/18/2013 05:42 PM, Martin Morgan wrote: >>>>>>>> >>>>>>>> Some of the most common operations are >>>>>>>> straight-forward, e.g., >>>>>>>> >>>>>>>>> gr = GRanges(paste0("chr", c(1:22, >>>>>>>> "X", "Y")), IRanges(1, >>>>>>>> 100)) >>>>>>>>> seqlevels(gr) = sub("chr", "ch", >>>>>>>> seqlevels(gr)) >>>>>>>> >>>>>>>> >>>>>>>> To get a more comprehensive example I should >>>>>>> have >>>> >>>>> followed my own >>>>>>>> advice and >>>>>>>> grabbed from the help page! >>>>>>>> >>>>>>>> ## Rename: >>>>>>>> seqlevels(txdb) <- sub("chr", "", >>>>>>>> seqlevels(txdb)) >>>>>>>> seqlevels(txdb) >>>>>>>> >>>>>>>> seqlevels(txdb) <- paste0("CH", >>>>>>>> seqlevels(txdb)) >>>>>>>> seqlevels(txdb) >>>>>>>> >>>>>>>> seqlevels(txdb)[seqlevels(__**** >>>>>>>> **__txdb) >>>>>>> == >>>> >>>>> >>>>>> >>>>>>>> "CHM"] <- "M" >>>>>>>> >>>>>>>> >>>>>>>> seqlevels(txdb) >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Computational Biology / Fred Hutchinson Cancer >>>>>>>> Research Center >>>>>>>> 1100 Fairview Ave. N. >>>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>>> >>>>>>>> Location: Arnold Building M1 B861 >>>>>>>> Phone: (206) 667-2793 >>>>>>> <tel:%28206%29%20667-2793> >>>> >>>>> <tel:%28206%29%20667-2793> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Dr. Martin Morgan, PhD >>>>>>>> >>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N. >>>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>>> >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> ______________________________****___________________ >>>>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.*** >>>>>>>> *org< >>>>>>> Bioc-devel@r-project.org> >>>>>> >>>>>>> >>>>>>>>> mailing list >>>>>>>> >>>>>>>> https://stat.ethz.ch/mailman/_****_listinfo/bioc-devel<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel> >>>>>>>> < >>>>>>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> <https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat.ethz.ch/mailman/**listinfo/bioc-devel> >>>>>>>> < >>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>>>> >>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Hervé Pagès >>>>>>>> >>>>>>>> Program in Computational Biology >>>>>>>> Division of Public Health Sciences >>>>>>>> >>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>> P.O. Box 19024 >>>>>>>> Seattle, WA 98109-1024 >>>>>>>> >>>>>>>> E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> >>>>>>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>>>>>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpa...@fhcrc.org >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioc-devel@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>> [[alternative HTML version deleted]] >>>> >>>> >>>> ______________________________**_________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> ______________________________**_________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel