On Aug 6, 10:49 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar> wrote: > En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabad...@gmail.com> > escribió: > > > Does any one know how to tokenize a string in python that returns the > > byte offsets and tokens? Moreover, the sentence splitter that returns > > the sentences and byte offsets? Finally n-grams returned with byte > > offsets. > > > Input: > > This is a string. > > > Output: > > This 0 > > is 5 > > a 8 > > string. 10 > > Like this? > > py> import re > py> s = "This is a string." > py> for g in re.finditer("\S+", s): > ... print g.group(), g.start() > ... > This 0 > is 5 > a 8 > string. 10 > > -- > Gabriel Genellina
Hi, Thanks. Can you please tell me how to do for n-grams and sentences as well? -- http://mail.python.org/mailman/listinfo/python-list