On Sep 5, 5:29 pm, per <perfr...@gmail.com> wrote: > On Sep 5, 7:07 pm, "Rhodri James" <rho...@wildebst.demon.co.uk> wrote: > > > > > On Sat, 05 Sep 2009 23:54:08 +0100, per <perfr...@gmail.com> wrote: > > > On Sep 5, 6:42 pm, "Rhodri James" <rho...@wildebst.demon.co.uk> wrote: > > >> On Sat, 05 Sep 2009 22:54:41 +0100, per <perfr...@gmail.com> wrote: > > >> > I'm trying to efficiently "split" strings based on what substrings > > >> > they are made up of. > > >> > i have a set of strings that are comprised of known substrings. > > >> > For example, a, b, and c are substrings that are not identical to each > > >> > other, e.g.: > > >> > a = "0" * 5 > > >> > b = "1" * 5 > > >> > c = "2" * 5 > > > >> > Then my_string might be: > > > >> > my_string = a + b + c > > > >> > i am looking for an efficient way to solve the following problem. > > >> > suppose i have a short > > >> > string x that is a substring of my_string. I want to "split" the > > >> > string x into blocks based on > > >> > what substrings (i.e. a, b, or c) chunks of s fall into. > > > >> > to illustrate this, suppose x = "00111". Then I can detect where x > > >> > starts in my_string > > >> > using my_string.find(x). But I don't know how to partition x into > > >> > blocks depending > > >> > on the substrings. What I want to get out in this case is: "00", > > >> > "111". If x were "001111122", > > >> > I'd want to get out "00","11111", "22". > > > >> > is there an easy way to do this? i can't simply split x on a, b, or c > > >> > because these might > > >> > not be contained in x. I want to avoid doing something inefficient > > >> > like looking at all substrings > > >> > of my_string etc. > > > >> > i wouldn't mind using regular expressions for this but i cannot think > > >> > of an easy regular > > >> > expression for this problem. I looked at the string module in the > > >> > library but did not see > > >> > anything that seemd related but i might have missed it. > > > >> I'm not sure I understand your question exactly. You seem to imply > > >> that the order of the substrings of x is consistent. If that's the > > >> case, this ought to help: > > > >> >>> import re > > >> >>> x = "001111122" > > >> >>> m = re.match(r"(0*)(1*)(2*)", x) > > >> >>> m.groups() > > > >> ('00', '11111', '22')>>> y = "00111" > > >> >>> m = re.match(r"(0*)(1*)(2*)", y) > > >> >>> m.groups() > > > >> ('00', '111', '') > > > >> You'll have to filter out the empty groups for yourself, but that's > > >> no great problem. > > > > The order of the substrings is consistent but what if it's not 0, 1, 2 > > > but a more complicated string? e.g. > > > > a = 1030405, b = 1babcf, c = fUUIUP > > > > then the substring x might be 4051ba, in which case using a regexp > > > with (1*) will not work since both a and b substrings begin with the > > > character 1. > > > Right. This looks approximately nothing like what I thought your > > problem was. Would I be right in thinking that you want to match > > substrings of your potential "substrings" against the string x? > > > I'm sufficiently confused that I think I'd like to see what your > > use case actually is before I make more of a fool of myself. > > > -- > > Rhodri James *-* Wildebeest Herder to the Masses > > it's exactly the same problem, except there are no constraints on the > strings. so the problem is, like you say, matching the substrings > against the string x. in other words, finding out where x "aligns" to > the ordered substrings abc, and then determine what chunk of x belongs > to a, what chunk belongs to b, and what chunk belongs to c. > > so in the example i gave above, the substrings are: a = 1030405, b = > 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP > > given a substring like 4051ba, i'd want to split it into the chunks a, > b, and c. in this case, i'd want the result to be: ["405", "1ba"] -- > i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk > that belongs to be. in this case, there are no chunks of c. if x > instead were "4051babcffUU", the right output is: ["405", "1babcf", > "fUU"], which are the corresponding chunks of a, b, and c that make up > x respectively. > > i'm not sure how to approach this. any ideas/tips would be greatly > appreciated. thanks again.
a = "1030405" b = "1babcf" c = "fUUIUP" abc = "10304051babcffUUIUP" data = "4051babcffU" data_start = abc.find(data) b_start = abc.find(b) - data_start c_start = abc.find(c) - data_start print data[:b_start] print data[b_start:c_start] print data[c_start:] --output:-- 405 1babcf fU -- http://mail.python.org/mailman/listinfo/python-list