Re: Byte Offsets of Tokens, Ngrams and Sentences?

Muhammad Adeel Fri, 06 Aug 2010 03:18:28 -0700

On Aug 6, 10:49 am, "Gabriel Genellina" <[email protected]>
wrote:
> En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <[email protected]>  
> escribió:
>
> > Does any one know how to tokenize a string in python that returns the
> > byte offsets and tokens? Moreover, the sentence splitter that returns
> > the sentences and byte offsets? Finally n-grams returned with byte
> > offsets.
>
> > Input:
> > This is a string.
>
> > Output:
> > This  0
> > is      5
> > a       8
> > string.   10
>
> Like this?
>
> py> import re
> py> s = "This is a string."
> py> for g in re.finditer("\S+", s):
> ...   print g.group(), g.start()
> ...
> This 0
> is 5
> a 8
> string. 10
>
> --
> Gabriel Genellina


Hi,

Thanks. Can you please tell me how to do for n-grams and sentences as
well?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Byte Offsets of Tokens, Ngrams and Sentences?

Reply via email to