[il-antlr-interest: 24086] [antlr-interest] Customizing token separators without recompiling
Hi everyone, I'm new to the list and new to ANTLR. I have a specific problem I need to solve and I hope ANTLR can help. Our client has several end-customers who all have slightly different document formats used for data interchange. All the documents are basically 'standard' EDI documents, meaning they have the same basic syntax. However, some customers will use a '+' to separate values, some will use '*', others will use '~', etc. (I'm reminded of the old saying, "The great thing about standards is that there are so many to choose from!") So, basically, the following inputs are all basically the same, except for the character used to separate tokens: FST*4290*D*W*20070607 FST+4290+D+W+20070607 FST~4290~D~W~20070607 The thing is, we don't know ahead of time which separator characters might be used in the future, and we need to be able to tweak each end-customer's file format without re-compiling the lexer/parser. For example, a year from now there might be a customer who decides to use '_' or '$' or whatever, and we need to provide our client with a simple way (e.g. a per-customer configuration file) to customize the lexer/parser for such situations, without re-generating/re-compiling. So, is this possible with ANTLR? How would I do this? Would it require a custom Lexer subclass with constructor parameters (e.g. new CustomLexer('_')) or something? How would this mesh with the generated lexer code from ANTLR? I'm quite new to tools such as ANTLR (and parsers in general), so any help would be much appreciated. I really don't know where to start with this problem. For a hand-coded parser it's fairly simple, but I don't know enough about the workings of ANTLR to see where I would need to tweak it. Thanks, Rob _ Create a cool, new character for your Windows Live™ Messenger. http://go.microsoft.com/?linkid=9656621 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
[il-antlr-interest: 24089] Re: [antlr-interest] Customizing token separators without recompiling
Hi, Sorry, I'm not following you. How would that work? E.g. A new customer comes along, they have their format that uses '_' (or whatever), and how do I get the lexer/parser to recognize their file format without re-generating/re-compiling the lexer/parser? What would Perl operate on? The grammar? Wouldn't that require re-generating/re-compiling the lexer? Rob Date: Sun, 7 Jun 2009 12:48:50 -0700 From: jsrs...@yahoo.com Subject: Re: [antlr-interest] Customizing token separators without recompiling To: antlr-inter...@antlr.org; dukie_bander...@hotmail.com Howdy, I'm guessing there's more to the problem than just supporting arbitrary field separation tokens, because if that's all there is, just use something like perl and store the separator(s) in a config file...? --S --- On Sun, 6/7/09, Dukie Banderjee wrote: From: Dukie Banderjee Subject: [antlr-interest] Customizing token separators without recompiling To: antlr-inter...@antlr.org Date: Sunday, June 7, 2009, 8:25 AM Hi everyone, I'm new to the list and new to ANTLR. I have a specific problem I need to solve and I hope ANTLR can help. Our client has several end-customers who all have slightly different document formats used for data interchange. All the documents are basically 'standard' EDI documents, meaning they have the same basic syntax. However, some customers will use a '+' to separate values, some will use '*', others will use '~', etc. (I'm reminded of the old saying, "The great thing about standards is that there are so many to choose from!") So, basically, the following inputs are all basically the same, except for the character used to separate tokens: FST*4290*D*W*20070607 FST+4290+D+W+20070607 FST~4290~D~W~20070607 The thing is, we don't know ahead of time which separator characters might be used in the future, and we need to be able to tweak each end-customer's file format without re-compiling the lexer/parser. For example, a year from now there might be a customer who decides to use '_' or '$' or whatever, and we need to provide our client with a simple way (e.g. a per-customer configuration file) to customize the lexer/parser for such situations, without re-generating/re-compiling. So, is this possible with ANTLR? How would I do this? Would it require a custom Lexer subclass with constructor parameters (e.g. new CustomLexer('_')) or something? How would this mesh with the generated lexer code from ANTLR? I'm quite new to tools such as ANTLR (and parsers in general), so any help would be much appreciated. I really don't know where to start with this problem. For a hand-coded parser it's fairly simple, but I don't know enough about the workings of ANTLR to see where I would need to tweak it. Thanks, Rob Create a cool, new character for your Windows Live™ Messenger. Check it out -Inline Attachment Follows- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
[il-antlr-interest: 24092] Re: [antlr-interest] Customizing token separators without recompiling
"If you simply want to break apart a line of text based on an arbitrary delimiter, it would be much easier to write a program in Perl, Python, Java, etc. that split the text based on a configuration setting." That's basically what I'm doing right now (in C#, by hand). Are you saying that ANTLR can't work at all with this? At some level it becomes a parsing issue. Each line has a different meaning, and should perform a different action and/or gather different information. It seems to me that these files would lend themselves very well to an intermediate AST form. For example, the style of document I showed you earlier was an Ansi 830 format. There is another format which is UN Edifact, which looks like this: DTM+2:20080523:102' QTY+1:1500:EA' SCC+1++D:ZZZ' Although this looks totally different, it is logically the same information as the previous example I showed (FST*...). I was hoping to use ANTLR to work on two different grammars to translate the raw text into tokens, which could further be translated into a generic command tree (basically to add records into a DB) that would be functionally equivalent whether it originally came from Ansi 830 or UN Edifact. It seems to me that ANTLR would have been a good tool to use to do this translation. I'd rather not be forced to do the entire thing by hand just because of this token separator issue. Is there a way I could perform the token splitting manually (as you suggest), but then feed the resulting tokens into an ANTLR-generated parser to do the rest of the work? Thanks, Rob Date: Sun, 7 Jun 2009 15:02:09 -0700 From: jsrs...@yahoo.com Subject: RE: [antlr-interest] Customizing token separators without recompiling To: antlr-inter...@antlr.org; dukie_bander...@hotmail.com Oh, I'm saying you wouldn't want to use a grammar at all. The problem you've described is lexical, not grammatical. If you simply want to break apart a line of text based on an arbitrary delimiter, it would be much easier to write a program in Perl, Python, Java, etc. that split the text based on a configuration setting. If further parsing needs to happen on the newly-split fields, then you can attack that problem piecemeal on an individual basis. Make sense? _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047 List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~---
[il-antlr-interest: 24094] Re: [antlr-interest] Customizing token separators without recompiling
Thanks, Steve, that looks very promising! Rob > Date: Mon, 8 Jun 2009 01:14:19 +0100 > Subject: Re: [antlr-interest] Customizing token separators without recompiling > From: st...@stevecooper.org > To: dukie_bander...@hotmail.com > CC: jsrs...@yahoo.com; antlr-inter...@antlr.org > > I don't know if this is any closer, but I had this idea. > > Your problem seems to be in getting a lexer which will give you the > right stream of tokens, and not in writing the parser that feeds off > them. You could write your own lexer to do the splitting of the > strings, and use ANTLR to write the parser. ANTLR parsers don't feed > directly off a string, but off an ITokenSource object; > > public interface ITokenSource > { > string SourceName { get; } > IToken NextToken(); > } > > You could create your own token source which would do the separation > by hand, and return a stream of tokens. Something like > > public class UnEdifactLexer: ITokenSource > { > // token types > public const int EOF = -1; > public const int ID = 0; > public const int NUMBER = 1; > public const int COLON = 2; > ... > > // all the tokens in the input > private Queue tokens; > > public UnEditfactLexer(string input, char userSeparator) > { > this.tokens = new Queue(); > foreach(var line in input.Split('\n')) > { > foreach(var piece in CustomSplit(userSeparator)) > { > // custom code to convert a line > // into a set of tokens > tokens.Enqueue(new Token(...)); > } > } > } > > public IToken NextToken() > { > if (tokens.Count> 0) > return tokens.Dequeue(); > else > return new Token(EOF,...); > } > } > > Then you write a parser grammar in ANTLR which does the parsing and > tree-building. > > Anyway, the benefit of this approach is that you have full power over > splitting up the strings and converting them into tokens. After that, > the parser takes up the strain. > > Steve _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826 List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~---
[il-antlr-interest: 24108] Re: [antlr-interest] Customizing token separators without recompiling
Thanks Jim, that looks more like what I originally had in mind. Rob > From: j...@temporal-wave.com > To: dukie_bander...@hotmail.com > Subject: Re: [antlr-interest] Customizing token separators without recompiling > Date: Sun, 7 Jun 2009 18:52:06 -0700 > CC: jsrs...@yahoo.com; antlr-inter...@antlr.org > > Hi, > > If the entire structure is just these lines then it is likely that a > parser is overkill to be honest. However you can create a lexer rule > that changes it's definition at runtime, but you must be careful that > set of delimiters would never otherwise appear in the input. > > What you do is add a member method to the lexer that accepts > the delimiter then use a gated predicate to select the token: > > @lexer::members { > protected int delim; > public void setDelim(int d) { > delim = d; > } > } > > DELIM : {input.LA(1) == d}?=> . ; > > But note that by using this rule, you will always get DELIM for that > character and so if you had: > > SEMI : ';' ; > > But set the delimiter to ';' then you would no longer get SEMI. > > Perhaps it would be best to write a custom lexer. > > EDU is another good idea screwed up by design by comittee where none > if the members will give up their proprietory formats :( > > Jim > > > On Jun 7, 2009, at 4:45 PM, Dukie Banderjee > wrote: > >> >> "If you simply want to break apart a line of text based on an >> arbitrary >> delimiter, it would be much easier to write a program in Perl, Python, >> Java, etc. that split the text based on a configuration setting." >> >> That's basically what I'm doing right now (in C#, by hand). Are you >> saying that ANTLR can't work at all with this? >> >> At some level it becomes a parsing issue. Each line has a different >> meaning, and should perform a different action and/or gather >> different information. >> >> It seems to me that these files would lend themselves very well to >> an intermediate AST form. For example, the style of document I >> showed you earlier was an Ansi 830 format. There is another format >> which is UN Edifact, which looks like this: >> DTM+2:20080523:102' >> QTY+1:1500:EA' >> SCC+1++D:ZZZ' >> >> Although this looks totally different, it is logically the same >> information as the previous example I showed (FST*...). >> >> I was hoping to use ANTLR to work on two different grammars to >> translate the raw text into tokens, which could further be >> translated into a generic command tree (basically to add records >> into a DB) that would be functionally equivalent whether it >> originally came from Ansi 830 or UN Edifact. >> >> It seems to me that ANTLR would have been a good tool to use to do >> this translation. I'd rather not be forced to do the entire thing by >> hand just because of this token separator issue. >> >> Is there a way I could perform the token splitting manually (as you >> suggest), but then feed the resulting tokens into an ANTLR-generated >> parser to do the rest of the work? >> >> Thanks, >> >> Rob >> >> Date: Sun, 7 Jun 2009 15:02:09 -0700 >> From: jsrs...@yahoo.com >> Subject: RE: [antlr-interest] Customizing token separators without >> recompiling >> To: antlr-inter...@antlr.org; dukie_bander...@hotmail.com >> >> Oh, I'm saying you wouldn't want to use a grammar at all. The >> problem you've described is lexical, not grammatical. If you simply >> want to break apart a line of text based on an arbitrary delimiter, >> it would be much easier to write a program in Perl, Python, Java, >> etc. that split the text based on a configuration setting. >> >> If further parsing needs to happen on the newly-split fields, then >> you can attack that problem piecemeal on an individual basis. >> >> Make sense? >> >> >> _ >> We are your photos. Share us now with Windows Live Photos. >> http://go.microsoft.com/?linkid=9666047 >> >> List: http://www.antlr.org/mailman/listinfo/antlr-interest >> Unsubscribe: >> http://www.antlr.org/mailman/options/antlr-interest/your-email-address _ Attention all humans. We are your photos. Free us. http://go.microsoft.com/?linkid=9666046 List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~---
[il-antlr-interest: 24111] [antlr-interest] Bug in AntlrWorks debugger
Hi, Is this the right place to post AntlrWorks bugs? I looked around but didn't find any other place. It seems that AntlrWorks does not accept Tab characters (or backslashes, for that matter) in the Text field of the Input Text dialog box when you press the Debug button. The result was that the grammar was compiled, and prepared, but when it came to connecting to the debugger it timed out. Removing the Tab character from the input allowed the debugger to connect. Originally, I thought for sure it was something to do with Windows Firewall, or my Java configuration, so I reinstalled Java and reset the Firewall exception for java.exe and javaw.exe. No change. It's definitely the Tab character in the input text. I thought this was very strange, so I tried to use \t instead of a Tab character, however this also caused the debugger connection to time out. I also tried \\t and t, just in case it was some sort of escaping issue, but none of these worked either. Only when I remove Tabs and Backslashes does the debugger connect, and then debugging works fine. I'm using Windows, and I wonder if this might have something to do with it. But still it seems strange. How is the input text transmitted to the debugger? Via command-line? That seems to me the only way a Tab or Backslash could interfere with connecting to the debugger. Anyone else notice this? Rob _ Create a cool, new character for your Windows Live™ Messenger. http://go.microsoft.com/?linkid=9656621 List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~---
[il-antlr-interest: 24125] Re: [antlr-interest] Bug in AntlrWorks debugger
> That could be. it's possible that the tab character means moveto the > next field > Ter No, it was definitely a Tab char in the input box. However, I discovered what the real underlying problem was. It wasn't actually the Tab char that was the problem, but that my grammar didn't recognize Tab, and so it was causing an infinite loop (I mistakenly used a * quantifier in one of my lexer rules instead of a +). This in turn caused AntlrWorks to think it was failing to connect, when actually the real problem was that the lexer process was stuck in a loop. Does the debugger perform a complete parse *before* attempting to connect to the process? That would probably explain it. I was expecting the debugger to first connect, and *then* attempt to parse. In that way, I could have debugged the infinite looping. However, all I saw was a mysterious "Failed to connect" message which sent me on a wild goose chase. FYI here's a sample grammar and input that would cause this usability bug to occur: grammar Bug; file : line+ EOF ; line : field+ terminator ; field : SEP TEXT? ; terminator : (NEWLINE | EOF) ; NEWLINE: '\r'? '\n' ; TEXT : ('a'..'z'|'A'..'Z'|' ')* ; // NOTE: Does not match Tab, and also has a * quantifier SEP: '*' ; Input: *a*a[Tab]a** Note: Replace '[Tab]' with an actual Tab char. (In fact, even the '[' char will cause the bug to occur.) Rob > CC: antlr-inter...@antlr.org > From: pa...@cs.usfca.edu > To: dukie_bander...@hotmail.com > Subject: Re: [antlr-interest] Bug in AntlrWorks debugger > Date: Mon, 8 Jun 2009 13:22:13 -0700 > > > On Jun 8, 2009, at 12:14 PM, Dukie Banderjee wrote: > >> >> Hi, >> >> Is this the right place to post AntlrWorks bugs? I looked around but >> didn't find any other place. >> >> It seems that AntlrWorks does not accept Tab characters (or >> backslashes, for that matter) in the Text field of the Input Text >> dialog box when you press the Debug button. > > That could be. it's possible that the tab character means moveto the > next field > Ter _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047 List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~---
[il-antlr-interest: 24185] [antlr-interest] Keywords vs. freeform text
Hi, Hope this isn't too much of a newbie question. I need to parse a format (EDI) which is basically delimited fields, but some fields must contain standardized code values whereas other fields can contain freeform text. My question is related to lexing and/or parsing. Do I need to/want to have a lexer token for each possible code, or should I just accept a freeform TEXT token, and then later parse the actual text to determine if its a valid code? I currently have a grammar which handles *some* of the more important codes by specifying lexer tokens. E.g.: ST: 'ST' ; BFR: 'BFR' ; N1: 'N1' ; REF: 'REF' ; etc... And a freeform TEXT token: TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+ ; Then I use a parser rule for those possible fields where *any* text is allowed, even a code: fieldText: TEXT | code ; code: ST | BFR | N1 | REF etc... This seems to be working okay for now, but I forsee problems as I'm trying to expand the grammar to work with all the various codes defined by the EDI standards. For example, some of the codes contain solely numeric characters, such as '09' or '01' or '12'. Later, I want to add checking for freeform numeric fields, such as those which might contain quantities or arbitrary integers. I think it will start to get ugly if I try to specify lexer tokens and parser rules like this: CODE_09: '09' NUMERIC: ('0'..'9')+ numericField: NUMERIC | numericCode numericCode: CODE_09 | CODE_01 ... etc. The core issue is that I need to *sometimes* treat certain fixed sequences of characters (e.g. 'ST' or '09') as special, and sometimes as merely freeform text or numeric values. I'm fairly new to ANTLR (and parsing/lexing), so I'm not really sure what's a good way to resolve this. Any tips/pointers? Example input: ISA*00**00**01*812520286 // Here '01' is a special code which determines the type/format of the following field SE*01*1052 // Here '01' is simply a numeric value which should be interpreted as an integer. As a side topic, how do I write a lexer which properly handles both: a) freeform alphanumeric (and spaces) input such as ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+ b) freeform numeric input such as ('0'..'9')+ Is this doomed to be ambiguous? Should it be handled by the parser? Is there a way to handle it in the lexer? Thanks Rob _ Create a cool, new character for your Windows Live™ Messenger. http://go.microsoft.com/?linkid=9656621 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
[il-antlr-interest: 24223] [antlr-interest] Basic predicate question re: lexer
Hi, I'm working on a parser for a file format that can contain text and delimiters. One of the delimiters is a ':', and you can escape the delimiter by following it with a '?' such as ':?'. I'd like to have the lexer consider the ':?' as part of the TEXT token, and ':' match the SEPARATOR token. So I tried this: file: contents+ EOF ; contents: TEXT | CSEP ; CSEP: ':'; TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|COLON)+ ; fragment COLON: ':?' ; This didn't work, as the input "hello:" caused an error. I guess it was expecting to continue a TEXT token (next char would be '?'), and met with an EOF instead. I imagine there's a way to do this, perhaps with predicates, but I'm not experienced enough to see the obvious solution. Can anyone help? Thanks, Rob _ Create a cool, new character for your Windows Live™ Messenger. http://go.microsoft.com/?linkid=9656621 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
[il-antlr-interest: 24224] [antlr-interest] CSharp2 code generation bug for ANTLRWorks 1.2.3 with -debug
The following grammar produces uncompilable code when generated from ANTLRWorks using -debug in the ANTLR options. I'm not sure which version of ANTLR is being used by ANTLRWorks. If it matters, I have ANTLR 3.1.3 on my machine. grammar EdifactDelfor; options { language = 'CSharp2' ; } tokens { } file: contents+ EOF ; contents: TEXT | SEP | WS | CSEP | EOL ; EOL: '\''; SEP: '+'; CSEP: ':'; TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|'?')+ ; WS: ('\r'? '\n')+ ; Here is the culprit code that was generated (in the file() method): default: if (cnt1 >= 1) goto loop1; EarlyExitException eee1 = new EarlyExitException(1, input); dbg.RecognitionException(eee); // Note the missing '1': should be eee1 throw eee1; When I manually change the reference to eee1, the thing compiles. This bug does not appear when -debug is turned off. Rob _ Attention all humans. We are your photos. Free us. http://go.microsoft.com/?linkid=9666046 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
[il-antlr-interest: 24225] [antlr-interest] Matching empty string
Hi, My grammar needs to handle the following situation: A line can have multiple fields, separated by a delimiter. A field can have multiple components, separated by another delimiter. If a field or component is blank, it should be counted as a blank field or blank component. For example with field delimiter '+' and component delimiter ':' : UNB++::+123 is a 'UNB' line with 3 fields. The first field is blank, the second field has 3 blank components, and the last field has a single component with the value '123'. Here is my grammar so far: line : TEXT fields ; fields: field* terminator! ; field: SEP t=fieldText? -> ^(FIELD $t?) ; fieldText: comp (CSEP comp)* ; comp: t=TEXT -> ^(COMP $t) ; When a field is blank, e.g. '++', this correctly generates a ^(FIELD) with no children. However, I don't know how to get similar behaviour for the components, because whereas a field starts with a SEP and optional TEXT, the component may or may not have a leading CSEP. When the input is '+::+', there are three components, but the first is entirely blank, an empty string. What I would like is that '+::+' creates ^(FIELD ^(COMP) ^(COMP) ^(COMP)). How can I accomplish this? Thanks, Rob _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
[il-antlr-interest: 24246] [antlr-interest] CSharp2 -debug generation bug
The following grammar compiles fine under ANTLR 3.1.3 except if you use the -debug option, in which case it throws an exception during generation. Exception trace follows. The culprit line is: message: unhSegment bgmSegment segment+ linLoop untSegment -> ^(MESSAGE unhSegment bgmSegment segment+ linLoop untSegment) ; On this line, I added linLoop on both sides. linLoop in turn references segment+, which I suspect might be the problem. However, this grammar appears to generate okay when -debug is off. (The grammar is functionally flawed, but in any case it should not cause an exception for ANTLR to generate the parser from it.) Thanks, Rob grammar Test; options { language = 'CSharp2' ; output = AST ; } tokens { INTERCHANGE; GROUP; MESSAGE; LOOP; SECTION; SEGMENT; ELEMENT; COMPONENT; } file: interchange EOF -> interchange; interchange: unaSegment? unbSegment group unzSegment -> ^(INTERCHANGE unaSegment? unbSegment group unzSegment) ; group: ungSegment message+ uneSegment -> ^(GROUP ungSegment message+ uneSegment) ; message: unhSegment bgmSegment segment+ linLoop untSegment -> ^(MESSAGE unhSegment bgmSegment segment+ linLoop untSegment) ; linLoop: linSection+ -> ^(LOOP linSection+) ; linSection: linSegment segment+ -> ^(SECTION linSegment segment+ ) ; bgmSegment: tagBGM elements -> ^(SEGMENT tagBGM elements) ; linSegment: tagLIN elements -> ^(SEGMENT tagLIN elements) ; tagBGM: { input.LT(1).Text == "BGM" }? TEXT ; tagLIN: { input.LT(1).Text == "LIN" }? TEXT ; unaSegment: tagUNA elements -> ^(SEGMENT tagUNA elements) ; unbSegment: tagUNB elements -> ^(SEGMENT tagUNB elements) ; ungSegment: tagUNG elements -> ^(SEGMENT tagUNG elements) ; unhSegment: tagUNH elements -> ^(SEGMENT tagUNH elements) ; untSegment: tagUNT elements -> ^(SEGMENT tagUNT elements) ; uneSegment: tagUNE elements -> ^(SEGMENT tagUNE elements) ; unzSegment: tagUNZ elements -> ^(SEGMENT tagUNZ elements) ; tagUNA: { input.LT(1).Text == "UNA" }? TEXT ; tagUNB: { input.LT(1).Text == "UNB" }? TEXT ; tagUNG: { input.LT(1).Text == "UNG" }? TEXT ; tagUNH: { input.LT(1).Text == "UNH" }? TEXT ; tagUNT: { input.LT(1).Text == "UNT" }? TEXT ; tagUNE: { input.LT(1).Text == "UNE" }? TEXT ; tagUNZ: { input.LT(1).Text == "UNZ" }? TEXT ; segment: tag elements -> ^(SEGMENT tag elements); //{ Console.WriteLine("Found segment: " + $tag.text); } tag:TEXT ; ignoredLine: unknownDiscriminator! elements! ; unknownDiscriminator: TEXT; elements: element* terminator! ; element: ELEMENT_SEPARATOR t=components? -> ^(ELEMENT $t?) ; components: comp1 comp2* ; comp2: COMPONENT_SEPARATOR t=TEXT? -> ^(COMPONENT $t?) ; comp1: t=TEXT -> ^(COMPONENT $t) | COMPONENT_SEPARATOR t=TEXT? -> ^(COMPONENT) ^(COMPONENT $t?) ; terminator: SEGMENT_TERMINATOR ; //terminator: (EOL | WS)+ ; SEGMENT_TERMINATOR: '\''; ELEMENT_SEPARATOR: '+'; COMPONENT_SEPARATOR: ':'; //TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|'?')+ ; TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|ESCAPE)+ ; fragment ESCAPE : '?' . ; WS: ('\r'? '\n')+ { $channel = 99; } ; error(10): internal error: <>.g : java.util.NoSuchElementException: no such attribute: description in template context [out putFile parser genericParser(...) cyclicDFA if(dfa.specialStateSTs)_subtemplate anonymous cyclicDFAState cyclicDFAEdge notPredicate eva lPredicate(...)] org.antlr.stringtemplate.StringTemplate.checkNullAttributeAgainstFormalArguments(StringTemplate.java:1276) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:800) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798) org.antlr.stringtemplate.StringTemplate.getAttribute(StringTemplate.java:682) org.antlr.stringtemplate.language.ActionEvaluator.attribute(ActionEvaluator.java:360) org.antlr.stringtemplate.language.ActionEvaluator.expr(ActionEvaluator.java:136) org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:84) org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:148) org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:700) org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:722) org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:659) org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:148) org.a
[il-antlr-interest: 24287] [antlr-interest] Multi-phase tree rewriting question
Hi, I'm considering how to solve the following problem: The documents I'm parsing (EDI) are actually wrapped in 'envelopes'. Similar to XML, the envelopes have a beginning line and an ending line. The lines in between are the actual documents which are wrapped. However, different wrapped documents have different formats, so my current idea is to have a grammar for each document format, and an 'outer' grammar for the enveloping structure. The envelope grammar knows about its beginning lines and ending lines, and the fact that there are lines in between, but it does not parse the contained lines into a hierarchical AST; that job is for the specific document parser. So, the idea is that I first parse a file into an envelope AST which has this basic structure: ^(ENVELOPE ... ^(GROUP ... ^(DOCUMENT type ^(LINE ...) // Straight list of lines ^(LINE ...) ... ) ... ) ... ) Then the next phase would be to rewrite the DOCUMENT AST to add some structure to it, based on the EDI standards: ^(DOCUMENT type ^(LINE ...) // Straight list of lines ^(LINE ...) ... ) becomes: ^(DOCUMENT type ^(HEADER_INFO ...) // More-detailed AST ^(ITEMS ^(ITEM ^(DETAIL ...) // Hierarchical structure added ^(DETAIL ...) ... ) ... ) ... ) A major reason to go with this approach, I think, is that there are literally hundreds of different types of EDI documents (although I'm only concerned with parsing fewer than ten of these types), but the enveloping structure is exactly the same, regardless of the contained documents. Also, there are different versions of the EDI standards, and so I would need slightly different parsers for different versions of the same document type. For example, an Edifact DELFOR document may be based on the 96a standard, or the 07b standard, or whatever, with slightly different (sometimes largely different) structures. So, I could have Delfor96aParser, and Delfor07bParser, and the information returned by the envelope grammar would allow me to select which version of EDI is being used, and therefore which document-parser to invoke. So, if I had this envelop parser, which can generate a document-AST which is a straight list of lines, how would I then further process this AST to add some structure to it, using a document-specific parser? I was looking at ANTLR's 'rewrite' option, but from what I've read it seems to only work with template output. I don't want to output text, I want to rewrite AST to AST, and then process the AST later. There is no text output needed. Would I use a tree-walker? If so, wouldn't this require creating a whole new copy of the document-AST? Is this the right approach? Or is an in-place rewrite more appropriate? If so, how do I do it? Any help would be appreciated, even if its just pointing me to the right place to find the info. Thanks, Rob _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~--~~~~--~~--~--~--- List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address