If we could change the Flex file so that yyreset(Reader) would check the size of zzBuffer, we could trim it when it gets too big. But I don't think we have such control when writing the flex syntax ... yyreset is generated by JFlex and that's the only place I can think of to trim the buffer down when it exceeds a predefined threshold ....
Maybe what we can do is create our own method which will be called by StandardTokenizer after yyreset is called, something like trimBufferIfTooBig(int threshold) which will reallocate zzBuffer if it exceeded the threshold. We can decide on a reasonable 64K threshold or something, or simply always cut back to 16 KB. As far as I understand, that buffer should never grow that much. I.e. in zzRefill, which is the only place where the buffer gets resized, there is an attempt to first move back characters that were already consumed and only then allocate a bigger buffer. Which means only if there is a token whose size is larger than 16KB (!?), will this buffer get expanded. A trimBuffer method might not be that bad .. as a protective measure. What do you think? Of course, JFlex can fix it on their own ... but until that happens ... Shai On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler <u...@thetaphi.de> wrote: > > I would like to identify also the problematic document I have 10000 so, > > what > > would be the best way of identifying the one that it making zzBuffer to > > grow > > without control? > > Dont index your documents, but instead pass them directly to the analyzer > and consume the tokenstream manually. Then visit TermAttribute.termLength() > for each Token. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >