Vlastimil Brom wrote:
2008/12/8 Robocop <[EMAIL PROTECTED]>:
I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters,
...

Hi, not sure, if I understand the task completely, but maybe some of
the variants below using re may help (depending on what should be done
further with the resulting test segments);
in the first two possibilities the resulting lines are 50 characters
long + 1 for "\n"; possibly 49 would be used if needed.


import re

input_txt = """I'm having a little text parsing problem that i think
would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary).  So i immediately came up with
something along the lines of:"""

# print re.sub(r"((?s).{1,50}\b)", lambda m: m.group().ljust(50) +
"\n", input_txt) # re.sub using a function

I also thought of r"(.{1,50}\b)", but then I realised that there's a subtle problem: it says that the captured text should end on a word boundary, when, in fact, we just don't want it to split within a word. It would still be acceptable if it split between 2 non-word characters. Aargh! :-)

# for m in re.finditer(r"((?s).{1,50}\b)",  input_txt): # adjusting
the matches via finditer
#     print m.group().ljust(50)

print [chunk.ljust(50) for chunk in re.findall(r"((?s).{1,50}\b)",
input_txt)] # adjusting the matched parts in findall

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to