subhabangal...@gmail.com wrote: > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: >> Dear Group, >> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to >> discuss some coding issues. If any one of this learned room can shower >> some light I would be helpful enough. >> >> I got to code a bunch of documents which are combined together. >> Like, >> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by >> lightning on Tuesday evening that led to complete communication failure >> in mid-air and forced the pilot to make an emergency landing. 2) The >> discovery of a new sub-atomic particle that is key to understanding how >> the universe is built has an intrinsic Indian connection. 3) A bomb >> explosion outside a shopping mall here on Tuesday left no one injured, >> but Nigerian authorities put security agencies on high alert fearing more >> such attacks in the city. >> >> The task is to separate the documents on the fly and to parse each of the >> documents with a definite set of rules. >> >> Now, the way I am processing is: >> I am clubbing all the documents together, as, >> >> A Mumbai-bound aircraft with 99 passengers on board was struck by >> lightning on Tuesday evening that led to complete communication failure >> in mid-air and forced the pilot to make an emergency landing.The >> discovery of a new sub-atomic particle that is key to understanding how >> the universe is built has an intrinsic Indian connection. A bomb >> explosion outside a shopping mall here on Tuesday left no one injured, >> but Nigerian authorities put security agencies on high alert fearing more >> such attacks in the city. >> >> But they are separated by a tag set, like, >> A Mumbai-bound aircraft with 99 passengers on board was struck by >> lightning on Tuesday evening that led to complete communication failure >> in mid-air and forced the pilot to make an emergency landing.$ The >> discovery of a new sub-atomic particle that is key to understanding how >> the universe is built has an intrinsic Indian connection.$ A bomb >> explosion outside a shopping mall here on Tuesday left no one injured, >> but Nigerian authorities put security agencies on high alert fearing more >> such attacks in the city. >> >> To detect the document boundaries, I am splitting them into a bag of >> words and using a simple for loop as, for i in range(len(bag_words)): >> if bag_words[i]=="$": >> print (bag_words[i],i) >> >> There is no issue. I am segmenting it nicely. I am using annotated corpus >> so applying parse rules. >> >> The confusion comes next, >> >> As per my problem statement the size of the file (of documents combined >> together) won’t increase on the fly. So, just to support all kinds of >> combinations I am appending in a list the “I” values, taking its length, >> and using slice. Works perfect. Question is, is there a smarter way to >> achieve this, and a curious question if the documents are on the fly with >> no preprocessed tag set like “$” how may I do it? From a bunch without >> EOF isn’t it a classification problem? >> >> There is no question on parsing it seems I am achieving it independent of >> length of the document. >> >> If any one in the group can suggest how I am dealing with the problem and >> which portions should be improved and how? >> >> Thanking You in Advance, >> >> Best Regards, >> Subhabrata Banerjee. > > > Hi Steven, It is nice to see your post. They are nice and I learnt so many > things from you. "I" is for index of the loop. Now my clarification I > thought to do "import os" and process files in a loop but that is not my > problem statement. I have to make a big lump of text and detect one chunk. > Looping over the line number of file I am not using because I may not be > able to take the slices-this I need. I thought to give re.findall a try > but that is not giving me the slices. Slice spreads here. The power issue > of string! I would definitely give it a try. Happy Day Ahead Regards, > Subhabrata Banerjee.
Then use re.finditer(): start = 0 for match in re.finditer(r"\$", data): end = match.start() print(start, end) print(data[start:end]) start = match.end() This will omit the last text. The simplest fix is to put another "$" separator at the end of your data. -- http://mail.python.org/mailman/listinfo/python-list