I was investigating this a little further and in the JFlex mailing list I found [1]
I don't know much about flex / JFlex but it seems that this guy resets the zzBuffer to 16384 or less when setting the input for the lexer Quoted from shef <she...@ya...> I set %buffer 0 in the options section, and then added this method to the lexer: /** * Set the input for the lexer. The size parameter really speeds things up, * because by default, the lexer allocates an internal buffer of 16k. For * most strings, this is unnecessarily large. If the size param is 0 or greater * than 16k, then the buffer is set to 16k. If the size param is smaller, then * the buf will be set to the exact size. * @param r the reader that provides the data * @param the size of the data in the reader. */ public void reset(Reader r, int size) { if (size == 0 || size > 16384) size = 16384; zzBuffer = new char[size]; yyreset(r); } So maybe there is a way to trim the zzBuffer this way (?). BTW, I will try to find out which is the "big token" in my dataset this afternoon. Thanks for the help. I actually workaround this memory problem for the time being by wrapping the IndexWriter in a class that periodically closes the IndexWriter and creates a new one, allowing the old to be GCed, but I would be really good if either JFlex or Lucene can take care of this zzBuffer going berserk. Again thanks for the quick response. /Rubén [1] https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422...@web38901.mail.mud.yahoo.com On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera <ser...@gmail.com> wrote: > If we could change the Flex file so that yyreset(Reader) would check the > size of zzBuffer, we could trim it when it gets too big. But I don't think > we have such control when writing the flex syntax ... yyreset is generated > by JFlex and that's the only place I can think of to trim the buffer down > when it exceeds a predefined threshold .... > > Maybe what we can do is create our own method which will be called by > StandardTokenizer after yyreset is called, something like > trimBufferIfTooBig(int threshold) which will reallocate zzBuffer if it > exceeded the threshold. We can decide on a reasonable 64K threshold or > something, or simply always cut back to 16 KB. As far as I understand, that > buffer should never grow that much. I.e. in zzRefill, which is the only > place where the buffer gets resized, there is an attempt to first move back > characters that were already consumed and only then allocate a bigger > buffer. Which means only if there is a token whose size is larger than 16KB > (!?), will this buffer get expanded. > > A trimBuffer method might not be that bad .. as a protective measure. What > do you think? Of course, JFlex can fix it on their own ... but until that > happens ... > > Shai > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler <u...@thetaphi.de> wrote: > > > > I would like to identify also the problematic document I have 10000 so, > > > what > > > would be the best way of identifying the one that it making zzBuffer to > > > grow > > > without control? > > > > Dont index your documents, but instead pass them directly to the analyzer > > and consume the tokenstream manually. Then visit > TermAttribute.termLength() > > for each Token. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > -- /Rubén