The following module was proposed for inclusion in the Module List: modid: Lingua::EN::Tokenizer::Offsets DSLIP: bdpfp description: Finds word (token) boundaries, and returns t userid: ANDREFS (André Fernandes dos Santos) chapterid: 11 (String_Lang_Text_Proc) communities: http://github.com/andrefs/Lingua-EN-Sentence-Offsets/issues
similar: Lingua::FreeLing3::Tokenizer rationale: Tokenizer (word splitter) for English with a twist (does for tokens what Lingua::EN::Sentence::Offsets does for sentences). Most tokenizers return either: - the original text with forced spacing between tokens - some kind of array with the tokens This module was primarily developed to, instead, return a list of pairs of start-end offsets for each token. This allows to know where each token starts and ends without the need of actually splitting the text. enteredby: ANDREFS (André Fernandes dos Santos) enteredon: Sun Jun 3 00:51:05 2012 GMT The resulting entry would be: Lingua::EN::Tokenizer:: ::Offsets bdpfp Finds word (token) boundaries, and returns t ANDREFS Thanks for registering, -- The PAUSE PS: The following links are only valid for module list maintainers: Registration form with editing capabilities: https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d0b00000_09d9564b03957820&SUBMIT_pause99_add_mod_preview=1 Immediate (one click) registration: https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d0b00000_09d9564b03957820&SUBMIT_pause99_add_mod_insertit=1 Peek at the current permissions: https://pause.perl.org/pause/authenquery?pause99_peek_perms_by=me&pause99_peek_perms_query=Lingua%3A%3AEN%3A%3ATokenizer%3A%3AOffsets