On 10/10/2010 11:00 AM, David Winsemius wrote: > > On Oct 10, 2010, at 11:35 AM, Martin Morgan wrote: > >> On 10/10/2010 07:11 AM, David Winsemius wrote: >>> >>> On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote: >>> >>>> >>>>> I already offered the Biostrings package. It provides more robust >>>>> methods for string matching than does grepl. Is there a reason that >>>>> you >>>>> choose not to? >>>>> >>>> >>>> Indeed that is the way I should go for and I have installed the >>>> package after some struggling. >>> >>> For me is was a matter of waiting. The only struggle was coming from my >>> inner timer saying it was taking too long. >>> >>>> Since biostring is a fairly complex package and I need only a way to >>>> check if a certain string A is a subset of string B, do you know the >>>> biostring functions to achieve this? >>>> I see a lot of methods for biological (DNA, RNA) sequences, and they >>>> may not apply to my series (which are definitely not from biology). >>>> Cheers >>> >>> It appeared to me that the function matchPattern should replace your >>> grepl invocation that was failing. It returns a more complex structure, >>> so you would need to determine what would be an exact replacement for >>> grepl(...) != 1. Looks like a no-match event resutls in the start and >>> end items being of length 0. >>> >>>> str( matchPattern("A", BString("BBB")) ) >> >> A couple of things from this thread. >> >> To install a Bioconductor package follow directions here >> >> http://bioconductor.org/install/index.html#install-bioconductor-packages >> >> which leads to >> >> source("http://bioconductor.org/biocLite.R") >> biocLite("Biostrings") >> >> biocLite is just a wrapper around install.packages with appropriate >> repositories defined. >> >> Some Bioconductor packages are relatively mature and make relatively >> advanced use of S4 classes, so looking at str() is not that helpful -- >> the way the user is meant to interact with the object is different from >> the way the object is implemented. So the best bet is to look at the >> relevant help pages >> >> result = matchPattern("A", BString("BBB")) >> class(result) >> class?XStringViews > > The above was the most surprising example for me (not being particularly > S4-savvy). Looks like it parses as: > `?`(class, XStringViews)
similarly ?"XStringViews-class" > Is that an S4 sort of extension for accessing documentation or have I > just missed a more general method? I tried looking at the help Index for > the "methods" package. ?"?" documents type?topic. It is more general, in that package?stats takes one to the 'stats' topic amongst the 'package' doc-type help pages. It relies on package authors choosing appropriate docTypes for their man pages. One S4 paradigm that can be useful is the analog of methods(class="lm"), which is showMethods(class="XStringViews", where="package:Biostrings"). Martin > >> >> and the help pages referenced there, or from which XStringViews inherits >> >> class("XStringViews") >> >> and in particular >> >> class?Ranges >> >> Rather than accessing the 'start' slot, use start(result). Vignettes are >> used heavily in Bioconductor packages, and in particular >> >> browseVignettes("Biostrings") >> >> pops up a page with several relevant vignettes, e.g., 'A short >> presentation of the basic classes...' and perhaps 'Pairwise Sequence >> Alignment'. These are also accessible on the Bioconductor web site, >> e.g., on the pages linked from >> >> http://bioconductor.org/help/bioc-views/release/bioc/ >> >> The rule of thumb hinted at below -- that an operation seems to be >> taking longer than it should -- probably indicates that the function is >> being invoked in an inefficient way. If the documentation is opaque then >> definitely the place to seek additional help is on the Bioconductor >> mailing list >> >> http://bioconductor.org/help/mailing-list/ >> >> Hope this helps. >> >> Martin >> >> >>> Formal class 'XStringViews' [package "Biostrings"] with 7 slots >>> ..@ subject :Formal class 'BString' [package "Biostrings"] with >>> 6 slots >>> .. .. ..@ shared :Formal class 'SharedRaw' [package "IRanges"] >>> with 2 slots >>> .. .. .. .. ..@ xp :<externalptr> >>> .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8> >>> .. .. ..@ offset : int 0 >>> .. .. ..@ length : int 3 >>> .. .. ..@ elementMetadata: NULL >>> .. .. ..@ elementType : chr "ANY" >>> .. .. ..@ metadata : list() >>> ..@ start : int(0) >>> ..@ width : int(0) >>> ..@ NAMES : NULL >>> ..@ elementMetadata: NULL >>> ..@ elementType : chr "integer" >>> ..@ metadata : list() >>> >>> Perhaps: >>> >>> length(matchPattern(fut_string, past_string)@start ) == 0 >>> >>> You do need to use BString() on at least the past_string argument and >>> maybe the fut_string as well. The BioConductor Mailing List would have a >>> larger audience with experience using this package, so they should >>> probably be your next avenue for advice. I am just reading the help >>> pages as you should be able to do. The help page >>> help("lowlevel-matching") should probably be reviewed since there may be >>> efficiency issues to consider as mentioned below. >>> >>> When dropped into your function with the BString coercion, it replicated >>> your small example results and did not crash after a long period with >>> your larger example, so I then terminated it and insert a "reporter" >>> line to monitor progress. With that reporter I got up into the 200's for >>> count_len without error. My laptop CPU was warming up the case and I was >>> getting sleepy so I terminated the process. (I had no way of checking >>> for accuracy, even if I had let it proceed, since you did not offer a >>> "correct" answer.) >>> >>> By the way, the construct ... grepl(. , .) != 1 ... is perhaps >>> inefficient. It could more compactly be expressed as ... !grepl(. , >>> .) which would not be doing coercion of logicals to integers. >>> >> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.