If you want the inputs to be chunked by sentence, you can try and split sentences by the period character. You can do this with the DelimitedInputFormat, by setting the delimiter.
The readAsText uses actually a special case delimited input format that splits at line breaks. Greetings, Stephan On Wed, May 20, 2015 at 2:57 PM, Felix Schüler <fschue...@posteo.de> wrote: > Hi! > > We have implemented a transformer that computes a cooccurrence matrix > for words within a given window. > This matrix will then be used for unsupervised learning of vector > representations for words (we basically implement this: > http://nlp.stanford.edu/projects/glove/) > > Right now, we have implemented the computation of the cooccurrence > matrix as a sliding window over lines that we get from > env.readTextFile(...) > Instead, it would be nice if we could do a sliding window over > sentences. Until now, we could not figure out how to get sentences that > (in the worst case) span multiple lines. > > Is this somehow possible or would we have to define our own input-format > for this? The idea is to read a corpus and allow some kind of user > defined parsing of the text documents (something like CorpusInputFormat > maybe...?). > > Thanks! > Felix >