On Oct 10, 2010, at 11:35 AM, Martin Morgan wrote:

On 10/10/2010 07:11 AM, David Winsemius wrote:

On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote:


I already offered the Biostrings package. It provides more robust
methods for string matching than does grepl. Is there a reason that you
choose not to?


Indeed that is the way I should go for and I have installed the
package after some struggling.

For me is was a matter of waiting. The only struggle was coming from my
inner timer saying it was taking too long.

Since biostring is a fairly complex package and I need only a way to
check if a certain string A is a subset of string B, do you know the
biostring functions to achieve this?
I see a lot of methods for biological (DNA, RNA) sequences, and they
may not apply to my series (which are definitely not from biology).
Cheers

It appeared to me that the function matchPattern should replace your
grepl invocation that was failing. It returns a more complex structure,
so you would need to determine what would be an exact replacement for
grepl(...) != 1. Looks like a no-match event resutls in the start and
end items being of length 0.

str(  matchPattern("A", BString("BBB")) )

A couple of things from this thread.

To install a Bioconductor package follow directions here

 http://bioconductor.org/install/index.html#install-bioconductor-packages

which leads to

  source("http://bioconductor.org/biocLite.R";)
  biocLite("Biostrings")

biocLite is just a wrapper around install.packages with appropriate
repositories defined.

Some Bioconductor packages are relatively mature and make relatively
advanced use of S4 classes, so looking at str() is not that helpful --
the way the user is meant to interact with the object is different from
the way the object is implemented. So the best bet is to look at the
relevant help pages

 result = matchPattern("A", BString("BBB"))
 class(result)
 class?XStringViews

The above was the most surprising example for me (not being particularly S4-savvy). Looks like it parses as:
`?`(class, XStringViews)

Is that an S4 sort of extension for accessing documentation or have I just missed a more general method? I tried looking at the help Index for the "methods" package.


and the help pages referenced there, or from which XStringViews inherits

  class("XStringViews")

and in particular

  class?Ranges

Rather than accessing the 'start' slot, use start(result). Vignettes are
used heavily in Bioconductor packages, and in particular

  browseVignettes("Biostrings")

pops up a page with several relevant vignettes, e.g., 'A short
presentation of the basic classes...' and perhaps 'Pairwise Sequence
Alignment'. These are also accessible on the Bioconductor web site,
e.g., on the pages linked from

 http://bioconductor.org/help/bioc-views/release/bioc/

The rule of thumb hinted at below -- that an operation seems to be
taking longer than it should -- probably indicates that the function is being invoked in an inefficient way. If the documentation is opaque then
definitely the place to seek additional help is on the Bioconductor
mailing list

 http://bioconductor.org/help/mailing-list/

Hope this helps.

Martin


Formal class 'XStringViews' [package "Biostrings"] with 7 slots
..@ subject :Formal class 'BString' [package "Biostrings"] with
6 slots
.. .. ..@ shared :Formal class 'SharedRaw' [package "IRanges"]
with 2 slots
 .. .. .. .. ..@ xp                    :<externalptr>
 .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8>
 .. .. ..@ offset         : int 0
 .. .. ..@ length         : int 3
 .. .. ..@ elementMetadata: NULL
 .. .. ..@ elementType    : chr "ANY"
 .. .. ..@ metadata       : list()
 ..@ start          : int(0)
 ..@ width          : int(0)
 ..@ NAMES          : NULL
 ..@ elementMetadata: NULL
 ..@ elementType    : chr "integer"
 ..@ metadata       : list()

Perhaps:

length(matchPattern(fut_string, past_string)@start ) == 0

You do need to use BString() on at least the past_string argument and
maybe the fut_string as well. The BioConductor Mailing List would have a
larger audience with experience using this package, so they should
probably be your next avenue for advice. I am just reading the help
pages as you should be able to do. The help page
help("lowlevel-matching") should probably be reviewed since there may be
efficiency issues to consider as mentioned below.

When dropped into your function with the BString coercion, it replicated
your small example results and did not crash after a long period with
your larger example, so I then terminated it and insert a "reporter"
line to monitor progress. With that reporter I got up into the 200's for count_len without error. My laptop CPU was warming up the case and I was
getting sleepy so I terminated the process. (I had no way of checking
for accuracy, even if I had let it proceed, since you did not offer a
"correct" answer.)

By the way, the construct ... grepl(. , .) != 1 ... is perhaps
inefficient. It could more compactly be expressed as ...   !grepl(. ,
.)  which would not be doing coercion of logicals to integers.



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to