Thanks Jim.After your patient explanation, I think I am clear about where the problem is.
I think in the current C runtime target, a huge file leads to a huge number of tokens need to be pre-stored in memory. That fact makes ANTLR C runtime not suitable for huge file input due to the huge memory consuming issue. I think the best choice is definitely to hand craft a lexer(having a token buffer) and override the token stream. I believe that would be very useful especially in the scenarios of input couldn't be got by once such as socket-based input. 2008/12/17 Jim Idle <j...@temporal-wave.com> > On Tue, 16 Dec 2008 18:28:50 -0800, chain one <chain...@gmail.com> wrote: > > Thanks Jim, > > > Your suggestion is very helpful to me. > I have checked my grammar according to your advice and SKIP some useless > tokens. > After doing this, the peak memory usage decrease  from 660M to 480M. > However 480M is still a very large number to me. > > And I found some strange behaviors of Lexer of ANTLR3 C Runtime. > > The first two rules of my grammar to be called  when parsing  is: > syntax > : ISO_HEADER  header_sec data_sec ISO_END > ; > header_sec > : HEADER  file_des file_name file_schema END_SEC > ; > > I placed breakpoints in both functions. > And the first parameter of function "antlr3CommonTokenStreamSourceNew" >  which I think is used to adjust the token buffer size is left to the > default value ANTLR3_SIZE_HINT(1025). >        tokens = antlr3CommonTokenStreamSourceNew >  (ANTLR3_SIZE_HINT, TOKENSOURCE(lex)); > > I believe the number of tokens will never exceed 1025 when parsing, but > what happened later makes me doubt about this. > > > > When debugging, I found the breakpoint in syntax rule( top rule) was > triggered right after parser started to work. > That's OK, because synatx() is called in the main function > > But what's strange was , the breakpoint in header_sec was not be triggered > until the memory kept increasing to 478M!! > It is believed that the Lexer kept all the tokens recognized from the large > input file in buffer before parser really started to work. > Am I right? If I am, then what does ANTLR3_SIZE_HINT mean? How to adjust > the buffer size in C runtime? > > > 2008/12/17 Jim Idle <j...@temporal-wave.com> > >> On Tue, 16 Dec 2008 15:54:58 -0800, chain one <chain...@gmail.com> wrote: >> >> > Still waiting for help >> > I just wanna know, if c runtime target is suitable for large input? >> >> Yes. >> >> >> > >> > On 12/16/08, chain one <chain...@gmail.com> wrote: >> >> Hi, >> >> These days I am writing a parser for a kind of data file using C++. The >> >> format of the data file is simple, so the rules are simple. >> >> But when I feed a about 20M-size data file to the parser, the parser >> >> eats >> >> almost 600M+ memory. >> >> I am surprised by this result and I found most memory and time were >> >> consumed >> >> by the Lexer. >> >> There is possibly something not quite right with this then. >> >> However, a 20M input file is going to generate a lot of tokens and you >> need all of tokens in order to >> parse the input, hence you are using a lot of memory - especially if a lot >> of your tokens are just a few characters. If all your tokens were one >> character then you would need 20M tokens - that would be the worst case and >> your case will be something less than this. >> >> One way to reduce the number of tokens is to use the SKIP(); macro on >> tokens that you don't need the parser to see, such as ',' or ' ' and so on. >> Otherwise they are sitting in your token stream for no reason. Only mark >> them as hidden and keep them in the token stream if you will need to examine >> them later. Otherwise SKIP them. >> > > I think you are not understanding the interaction. The lexer will run first > and will produce a collection of ALL the tokens in the files. ONce it has > done this, THEN the parser will run. Hence you have a HUGE number of tokens > - many many more than 1024. You don't really need to change the size hint, > this is just how many tokens the lexer token factory pre-allocates in one > chunk. Increasing it will not change anything much as only the tokens are > getting allocated anyway (it is just faster to allocate them in chunks like > this). > > Looking at what you are parsing tends to make me think you would be better > off with a filtering lexer and avoiding the parser altogether. Then you > don't need to deal with any tokens, as you can SKIP them all. Or perhaps > this language does not lend it self to this kind of lexer/parser interaction > if you cannot deal with the huge memory. > > Here is another way though and that is to hand craft a lexer that does not > pre-allocate tokens like that and override the token stream etc. > > Jim > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~----------~----~----~----~------~----~------~--~---
List: http://www.antlr.org:8080/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org:8080/mailman/options/antlr-interest/your-email-address