[il-antlr-interest: 24086] [antlr-interest] Customizing token separators without recompiling

2009-06-07 Thread Dukie Banderjee

Hi everyone,

I'm new to the list and new to ANTLR. I have a specific problem I need to solve 
and I hope ANTLR can help.

Our client has several end-customers who all have slightly different document 
formats used for data interchange.

All the documents are basically 'standard' EDI documents, meaning they have the 
same basic syntax. However, some customers will use a '+' to separate values, 
some will use '*', others will use '~', etc. (I'm reminded of the old saying, 
"The great thing about standards is that there are so many to choose from!")

So, basically, the following inputs are all basically the same, except for the 
character used to separate tokens:
FST*4290*D*W*20070607
FST+4290+D+W+20070607
FST~4290~D~W~20070607

The thing is, we don't know ahead of time which separator characters might be 
used in the future, and we need to be able to tweak each end-customer's file 
format without re-compiling the lexer/parser. For example, a year from now 
there might be a customer who decides to use '_' or '$' or whatever, and we 
need to provide our client with a simple way (e.g. a per-customer configuration 
file) to customize the lexer/parser for such situations, without 
re-generating/re-compiling.

So, is this possible with ANTLR? How would I do this? Would it require a custom 
Lexer subclass with constructor parameters (e.g. new CustomLexer('_')) or 
something? How would this mesh with the generated lexer code from ANTLR?

I'm quite new to tools such as ANTLR (and parsers in general), so any help 
would be much appreciated. I really don't know where to start with this 
problem. For a hand-coded parser it's fairly simple, but I don't know enough 
about the workings of ANTLR to see where I would need to tweak it.

Thanks,

Rob

_
Create a cool, new character for your Windows Live™ Messenger. 
http://go.microsoft.com/?linkid=9656621
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


[il-antlr-interest: 24089] Re: [antlr-interest] Customizing token separators without recompiling

2009-06-07 Thread Dukie Banderjee

Hi,

Sorry, I'm not following you. How would that work? E.g. A new customer comes 
along, they have their format that uses '_' (or whatever), and how do I get the 
lexer/parser to recognize their file format without re-generating/re-compiling 
the lexer/parser? What would Perl operate on? The grammar? Wouldn't that 
require re-generating/re-compiling the lexer?

Rob

Date: Sun, 7 Jun 2009 12:48:50 -0700
From: jsrs...@yahoo.com
Subject: Re: [antlr-interest] Customizing token separators without recompiling
To: antlr-inter...@antlr.org; dukie_bander...@hotmail.com

Howdy,

I'm guessing there's more to the problem than just supporting arbitrary field 
separation tokens, because if that's all there is, just use something like perl 
and store the separator(s) in a config file...?

--S

--- On Sun, 6/7/09, Dukie Banderjee  wrote:

From: Dukie Banderjee 
Subject: [antlr-interest] Customizing token separators without recompiling
To: antlr-inter...@antlr.org
Date: Sunday, June 7, 2009, 8:25 AM




Hi everyone,

I'm new to the list and new to ANTLR. I have a specific problem I need to solve 
and I hope ANTLR can help.

Our client has several end-customers who all have slightly different document 
formats used for data interchange.

All the documents are basically 'standard' EDI documents, meaning they have the 
same basic syntax. However, some customers will use a '+' to separate values, 
some will use '*', others will use '~', etc. (I'm reminded of the old saying, 
"The great thing about standards is that there are so many to choose from!")

So, basically, the following inputs are all basically the same, except for the 
character used to separate tokens:
FST*4290*D*W*20070607
FST+4290+D+W+20070607
FST~4290~D~W~20070607

The thing is, we don't know ahead of time which separator characters might be 
used in the future, and we need to be able to tweak each end-customer's file 
format without re-compiling the
 lexer/parser. For example, a year from now there might be a customer who 
decides to use '_' or '$' or whatever, and we need to provide our client with a 
simple way (e.g. a per-customer configuration file) to customize the 
lexer/parser for such situations, without re-generating/re-compiling.

So, is this possible with ANTLR? How would I do this? Would it require a custom 
Lexer subclass with constructor parameters (e.g. new CustomLexer('_')) or 
something? How would this mesh with the generated lexer code from ANTLR?

I'm quite new to tools such as ANTLR (and parsers in general), so any help 
would be much appreciated. I really don't know where to start with this 
problem. For a hand-coded parser it's fairly simple, but I don't know enough 
about the workings of ANTLR to see where I would need to tweak it.

Thanks,

Rob

Create a cool, new character for your Windows Live™ Messenger.  Check it out 

-Inline Attachment Follows-


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


[il-antlr-interest: 24092] Re: [antlr-interest] Customizing token separators without recompiling

2009-06-07 Thread Dukie Banderjee


"If you simply want to break apart a line of text based on an arbitrary
delimiter, it would be much easier to write a program in Perl, Python,
Java, etc. that split the text based on a configuration setting."

That's basically what I'm doing right now (in C#, by hand). Are you saying that 
ANTLR can't work at all with this?

At some level it becomes a parsing issue. Each line has a different meaning, 
and should perform a different action and/or gather different information.

It seems to me that these files would lend themselves very well to an 
intermediate AST form. For example, the style of document I showed you earlier 
was an Ansi 830 format. There is another format which is UN Edifact, which 
looks like this:
DTM+2:20080523:102'
QTY+1:1500:EA'
SCC+1++D:ZZZ'

Although this looks totally different, it is logically the same information as 
the previous example I showed (FST*...).

I was hoping to use ANTLR to work on two different grammars to translate the 
raw text into tokens, which could further be translated into a generic command 
tree (basically to add records into a DB) that would be functionally equivalent 
whether it originally came from Ansi 830 or UN Edifact.

It seems to me that ANTLR would have been a good tool to use to do this 
translation. I'd rather not be forced to do the entire thing by hand just 
because of this token separator issue.

Is there a way I could perform the token splitting manually (as you suggest), 
but then feed the resulting tokens into an ANTLR-generated parser to do the 
rest of the work?

Thanks,

Rob

Date: Sun, 7 Jun 2009 15:02:09 -0700
From: jsrs...@yahoo.com
Subject: RE: [antlr-interest] Customizing token separators without recompiling
To: antlr-inter...@antlr.org; dukie_bander...@hotmail.com

Oh, I'm saying you wouldn't want to use a grammar at all.  The problem you've 
described is lexical, not grammatical.  If you simply want to break apart a 
line of text based on an arbitrary delimiter, it would be much easier to write 
a program in Perl, Python, Java, etc. that split the text based on a 
configuration setting.

If further parsing needs to happen on the newly-split fields, then you can 
attack that problem piecemeal on an individual basis.

Make sense?


_
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---



[il-antlr-interest: 24094] Re: [antlr-interest] Customizing token separators without recompiling

2009-06-07 Thread Dukie Banderjee


Thanks, Steve, that looks very promising!

Rob


> Date: Mon, 8 Jun 2009 01:14:19 +0100
> Subject: Re: [antlr-interest] Customizing token separators without recompiling
> From: st...@stevecooper.org
> To: dukie_bander...@hotmail.com
> CC: jsrs...@yahoo.com; antlr-inter...@antlr.org
>
> I don't know if this is any closer, but I had this idea.
>
> Your problem seems to be in getting a lexer which will give you the
> right stream of tokens, and not in writing the parser that feeds off
> them. You could write your own lexer to do the splitting of the
> strings, and use ANTLR to write the parser. ANTLR parsers don't feed
> directly off a string, but off an ITokenSource object;
>
> public interface ITokenSource
> {
> string SourceName { get; }
> IToken NextToken();
> }
>
> You could create your own token source which would do the separation
> by hand, and return a stream of tokens. Something like
>
> public class UnEdifactLexer: ITokenSource
> {
> // token types
> public const int EOF = -1;
> public const int ID = 0;
> public const int NUMBER = 1;
> public const int COLON = 2;
> ...
>
> // all the tokens in the input
> private Queue tokens;
>
> public UnEditfactLexer(string input, char userSeparator)
> {
> this.tokens = new Queue();
> foreach(var line in input.Split('\n'))
> {
> foreach(var piece in CustomSplit(userSeparator))
> {
> // custom code to convert a line
> // into a set of tokens
> tokens.Enqueue(new Token(...));
> }
> }
> }
>
> public IToken NextToken()
> {
> if (tokens.Count> 0)
> return tokens.Dequeue();
> else
> return new Token(EOF,...);
> }
> }
>
> Then you write a parser grammar in ANTLR which does the parsing and
> tree-building.
>
> Anyway, the benefit of this approach is that you have full power over
> splitting up the strings and converting them into tokens. After that,
> the parser takes up the strain.
>
> Steve

_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---



[il-antlr-interest: 24108] Re: [antlr-interest] Customizing token separators without recompiling

2009-06-08 Thread Dukie Banderjee


Thanks Jim, that looks more like what I originally had in mind.

Rob


> From: j...@temporal-wave.com
> To: dukie_bander...@hotmail.com
> Subject: Re: [antlr-interest] Customizing token separators without recompiling
> Date: Sun, 7 Jun 2009 18:52:06 -0700
> CC: jsrs...@yahoo.com; antlr-inter...@antlr.org
>
> Hi,
>
> If the entire structure is just these lines then it is likely that a
> parser is overkill to be honest. However you can create a lexer rule
> that changes it's definition at runtime, but you must be careful that
> set of delimiters would never otherwise appear in the input.
>
> What you do is add a member method to the lexer that accepts
> the delimiter then use a gated predicate to select the token:
>
> @lexer::members {
> protected int delim;
> public void setDelim(int d) {
> delim = d;
> }
> }
>
> DELIM : {input.LA(1) == d}?=> . ;
>
> But note that by using this rule, you will always get DELIM for that
> character and so if you had:
>
> SEMI : ';' ;
>
> But set the delimiter to ';' then you would no longer get SEMI.
>
> Perhaps it would be best to write a custom lexer.
>
> EDU is another good idea screwed up by design by comittee where none
> if the members will give up their proprietory formats :(
>
> Jim
>
>
> On Jun 7, 2009, at 4:45 PM, Dukie Banderjee
>  wrote:
>
>>
>> "If you simply want to break apart a line of text based on an
>> arbitrary
>> delimiter, it would be much easier to write a program in Perl, Python,
>> Java, etc. that split the text based on a configuration setting."
>>
>> That's basically what I'm doing right now (in C#, by hand). Are you
>> saying that ANTLR can't work at all with this?
>>
>> At some level it becomes a parsing issue. Each line has a different
>> meaning, and should perform a different action and/or gather
>> different information.
>>
>> It seems to me that these files would lend themselves very well to
>> an intermediate AST form. For example, the style of document I
>> showed you earlier was an Ansi 830 format. There is another format
>> which is UN Edifact, which looks like this:
>> DTM+2:20080523:102'
>> QTY+1:1500:EA'
>> SCC+1++D:ZZZ'
>>
>> Although this looks totally different, it is logically the same
>> information as the previous example I showed (FST*...).
>>
>> I was hoping to use ANTLR to work on two different grammars to
>> translate the raw text into tokens, which could further be
>> translated into a generic command tree (basically to add records
>> into a DB) that would be functionally equivalent whether it
>> originally came from Ansi 830 or UN Edifact.
>>
>> It seems to me that ANTLR would have been a good tool to use to do
>> this translation. I'd rather not be forced to do the entire thing by
>> hand just because of this token separator issue.
>>
>> Is there a way I could perform the token splitting manually (as you
>> suggest), but then feed the resulting tokens into an ANTLR-generated
>> parser to do the rest of the work?
>>
>> Thanks,
>>
>> Rob
>>
>> Date: Sun, 7 Jun 2009 15:02:09 -0700
>> From: jsrs...@yahoo.com
>> Subject: RE: [antlr-interest] Customizing token separators without
>> recompiling
>> To: antlr-inter...@antlr.org; dukie_bander...@hotmail.com
>>
>> Oh, I'm saying you wouldn't want to use a grammar at all. The
>> problem you've described is lexical, not grammatical. If you simply
>> want to break apart a line of text based on an arbitrary delimiter,
>> it would be much easier to write a program in Perl, Python, Java,
>> etc. that split the text based on a configuration setting.
>>
>> If further parsing needs to happen on the newly-split fields, then
>> you can attack that problem piecemeal on an individual basis.
>>
>> Make sense?
>>
>>
>> _
>> We are your photos. Share us now with Windows Live Photos.
>> http://go.microsoft.com/?linkid=9666047
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: 
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address

_
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---



[il-antlr-interest: 24111] [antlr-interest] Bug in AntlrWorks debugger

2009-06-08 Thread Dukie Banderjee


Hi,

Is this the right place to post AntlrWorks bugs? I looked around but didn't 
find any other place.

It seems that AntlrWorks does not accept Tab characters (or backslashes, for 
that matter) in the Text field of the Input Text dialog box when you press the 
Debug button. 

The result was that the grammar was compiled, and prepared, but when it came to 
connecting to the debugger it timed out. Removing the Tab character from the 
input allowed the debugger to connect.

Originally, I thought for sure it was something to do with Windows Firewall, or 
my
Java configuration, so I reinstalled Java and reset the Firewall
exception for java.exe and javaw.exe. No change. It's definitely the Tab 
character in the input text.

I thought this was very strange, so I tried to use \t instead of a Tab 
character, however this also caused the debugger connection to time out. I also 
tried \\t and t, just in case it was some sort of escaping issue, but none 
of these worked either. Only when I remove Tabs and Backslashes does the 
debugger connect, and then debugging works fine.

I'm using Windows, and I wonder if this might have something to do with it. But 
still it seems strange. How is the input text transmitted to the debugger? Via 
command-line? That seems to me the only way a Tab or Backslash could interfere 
with connecting to the debugger.

Anyone else notice this?

Rob

_
Create a cool, new character for your Windows Live™ Messenger. 
http://go.microsoft.com/?linkid=9656621

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---



[il-antlr-interest: 24125] Re: [antlr-interest] Bug in AntlrWorks debugger

2009-06-08 Thread Dukie Banderjee


> That could be. it's possible that the tab character means moveto the
> next field
> Ter

No, it was definitely a Tab char in the input box. However, I discovered what 
the real underlying problem was.

It wasn't actually the Tab char that was the problem, but that my grammar 
didn't recognize Tab, and so it was causing an infinite loop (I mistakenly used 
a * quantifier in one of my lexer rules instead of a +). This in turn caused 
AntlrWorks to think it was failing to connect, when actually the real problem 
was that the lexer process was stuck in a loop.

Does the debugger perform a complete parse *before* attempting to connect to 
the process? That would probably explain it. I was expecting the debugger to 
first connect, and *then* attempt to parse. In that way, I could have debugged 
the infinite looping. However, all I saw was a mysterious "Failed to connect" 
message which sent me on a wild goose chase.

FYI here's a sample grammar and input that would cause this usability bug to 
occur:

grammar Bug;

file   : line+ EOF ;
line   : field+ terminator ;
field  : SEP TEXT? ;
terminator : (NEWLINE | EOF) ;

NEWLINE: '\r'? '\n' ;
TEXT   : ('a'..'z'|'A'..'Z'|' ')* ; // NOTE: Does not match Tab, and also 
has a * quantifier
SEP: '*' ;

Input:
*a*a[Tab]a**

Note: Replace '[Tab]' with an actual Tab char. (In fact, even the '[' char will 
cause the bug to occur.)

Rob


> CC: antlr-inter...@antlr.org
> From: pa...@cs.usfca.edu
> To: dukie_bander...@hotmail.com
> Subject: Re: [antlr-interest] Bug in AntlrWorks debugger
> Date: Mon, 8 Jun 2009 13:22:13 -0700
>
>
> On Jun 8, 2009, at 12:14 PM, Dukie Banderjee wrote:
>
>>
>> Hi,
>>
>> Is this the right place to post AntlrWorks bugs? I looked around but
>> didn't find any other place.
>>
>> It seems that AntlrWorks does not accept Tab characters (or
>> backslashes, for that matter) in the Text field of the Input Text
>> dialog box when you press the Debug button.
>
> That could be. it's possible that the tab character means moveto the
> next field
> Ter

_
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---



[il-antlr-interest: 24185] [antlr-interest] Keywords vs. freeform text

2009-06-12 Thread Dukie Banderjee

Hi, 
Hope this isn't too much of a newbie question.

I need to parse a format (EDI) which is basically delimited fields, but some 
fields must contain standardized code values whereas other fields can contain 
freeform text.

My question is related to lexing and/or parsing. Do I need to/want to have a 
lexer token for each possible code, or should I just accept a freeform TEXT 
token, and then later parse the actual text to determine if its a valid code?

I currently have a grammar which handles *some* of the more important codes by 
specifying lexer tokens. E.g.:

ST: 'ST' ;
BFR: 'BFR' ;
N1: 'N1' ;
REF: 'REF' ;
etc...

And a freeform TEXT token:
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+ ;

Then I use a parser rule for those possible fields where *any* text is allowed, 
even a code:
fieldText: TEXT | code ;

code: ST
| BFR
| N1
| REF
etc...

This seems to be working okay for now, but I forsee problems as I'm trying to 
expand the grammar to work with all the various codes defined by the EDI 
standards. For example, some of the codes contain solely numeric characters, 
such as '09' or '01' or '12'. Later, I want to add checking for freeform 
numeric fields, such as those which might contain quantities or arbitrary 
integers. I think it will start to get ugly if I try to specify lexer tokens 
and parser rules like this:

CODE_09: '09'

NUMERIC: ('0'..'9')+

numericField: NUMERIC | numericCode

numericCode: CODE_09 | CODE_01 ... etc.

The core issue is that I need to *sometimes* treat certain fixed sequences of 
characters (e.g. 'ST' or '09') as special, and sometimes as merely freeform 
text or numeric values.

I'm fairly new to ANTLR (and parsing/lexing), so I'm not really sure what's a 
good way to resolve this. Any tips/pointers?

Example input:
ISA*00**00**01*812520286  // Here '01' is a special code which determines the 
type/format of the following field
SE*01*1052   // Here '01' is simply a numeric value which should be interpreted 
as an integer.

As a side topic, how do I write a lexer which properly handles both:
a) freeform alphanumeric (and spaces) input such as 
('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+
b) freeform numeric input such as ('0'..'9')+
Is this doomed to be ambiguous? Should it be handled by the parser? Is there a 
way to handle it in the lexer?

Thanks

Rob

_
Create a cool, new character for your Windows Live™ Messenger. 
http://go.microsoft.com/?linkid=9656621
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


[il-antlr-interest: 24223] [antlr-interest] Basic predicate question re: lexer

2009-06-14 Thread Dukie Banderjee

Hi,
I'm working on a parser for a file format that can contain text and delimiters. 
One of the delimiters is a ':', and you can escape the delimiter by following 
it with a '?' such as ':?'.

I'd like to have the lexer consider the ':?' as part of the TEXT token, and ':' 
match the SEPARATOR token. So I tried this:
file: contents+ EOF ;

contents: TEXT
| CSEP
;

CSEP:  ':';
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|COLON)+ ;
fragment COLON: ':?' ;

This didn't work, as the input "hello:" caused an error. I guess it was 
expecting to continue a TEXT token (next char would be '?'), and met with an 
EOF instead.

I imagine there's a way to do this, perhaps with predicates, but I'm not 
experienced enough to see the obvious solution. Can anyone help?

Thanks,

Rob

_
Create a cool, new character for your Windows Live™ Messenger. 
http://go.microsoft.com/?linkid=9656621
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


[il-antlr-interest: 24224] [antlr-interest] CSharp2 code generation bug for ANTLRWorks 1.2.3 with -debug

2009-06-14 Thread Dukie Banderjee

The following grammar produces uncompilable code when generated from ANTLRWorks 
using -debug in the ANTLR options. I'm not sure which version of ANTLR is being 
used by ANTLRWorks. If it matters, I have ANTLR 3.1.3 on my machine.



grammar EdifactDelfor;

options {
  language = 'CSharp2' ;
}

tokens {
}

file: contents+ EOF ;

contents: TEXT
| SEP
| WS
| CSEP
| EOL
;
EOL: '\'';
SEP: '+';
CSEP: ':';
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|'?')+ ;
WS: ('\r'? '\n')+  ;



Here is the culprit code that was generated (in the file() method):
default:
if (cnt1 >= 1) goto loop1;
EarlyExitException eee1 =
new EarlyExitException(1, input);
dbg.RecognitionException(eee);   // Note the missing '1': should be eee1

throw eee1;

When I manually change the reference to eee1, the thing compiles.

This bug does not appear when -debug is turned off.

Rob

_
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


[il-antlr-interest: 24225] [antlr-interest] Matching empty string

2009-06-14 Thread Dukie Banderjee

Hi,
My grammar needs to handle the following situation: A line can have multiple 
fields, separated by a delimiter. A field can have multiple components, 
separated by another delimiter.
If a field or component is blank, it should be counted as a blank field or 
blank component. For example with field delimiter '+' and component delimiter 
':' :
UNB++::+123
is a 'UNB' line with 3 fields. The first field is blank, the second field has 3 
blank components, and the last field has a single component with the value 
'123'.

Here is my grammar so far:

line : TEXT fields ;
fields: field* terminator! ;
field: SEP t=fieldText? -> ^(FIELD $t?) ;
fieldText: comp (CSEP comp)* ;
comp: t=TEXT -> ^(COMP $t) ;

When a field is blank, e.g. '++', this correctly generates a ^(FIELD) with no 
children. However, I don't know how to get similar behaviour for the 
components, because whereas a field starts with a SEP and optional TEXT, the 
component may or may not have a leading CSEP. 

When the input is '+::+', there are three components, but the first is entirely 
blank, an empty string. 

What I would like is that '+::+' creates ^(FIELD ^(COMP) ^(COMP) ^(COMP)). How 
can I accomplish this?

Thanks,

Rob

_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


[il-antlr-interest: 24246] [antlr-interest] CSharp2 -debug generation bug

2009-06-15 Thread Dukie Banderjee

The following grammar compiles fine under ANTLR 3.1.3 except if you use the 
-debug option, in which case it throws an exception during generation. 
Exception trace follows.

The culprit line is:
message: unhSegment bgmSegment segment+ linLoop untSegment -> ^(MESSAGE 
unhSegment bgmSegment segment+ linLoop untSegment) ;

On this line, I added linLoop on both sides. linLoop in turn references 
segment+, which I suspect might be the problem. However, this grammar appears 
to generate okay when -debug is off. (The grammar is functionally flawed, but 
in any case it should not cause an exception for ANTLR to generate the parser 
from it.)


Thanks,
Rob



grammar Test;

options {
  language = 'CSharp2' ;
  output = AST ;
}

tokens {
INTERCHANGE;
GROUP;
MESSAGE;
LOOP;
SECTION;
SEGMENT;
ELEMENT;
COMPONENT;
}

file: interchange EOF -> interchange;

interchange: unaSegment? unbSegment group unzSegment -> ^(INTERCHANGE 
unaSegment? unbSegment group unzSegment) ;
group: ungSegment message+ uneSegment -> ^(GROUP ungSegment message+ 
uneSegment) ;
message: unhSegment bgmSegment segment+ linLoop untSegment -> ^(MESSAGE 
unhSegment bgmSegment segment+ linLoop untSegment) ;

linLoop: linSection+ -> ^(LOOP linSection+) ;
linSection: linSegment segment+ -> ^(SECTION linSegment segment+ ) ;

bgmSegment: tagBGM elements -> ^(SEGMENT tagBGM elements) ;
linSegment: tagLIN elements -> ^(SEGMENT tagLIN elements) ;

tagBGM: { input.LT(1).Text == "BGM" }? TEXT ;
tagLIN: { input.LT(1).Text == "LIN" }? TEXT ;


unaSegment: tagUNA elements -> ^(SEGMENT tagUNA elements) ;
unbSegment: tagUNB elements -> ^(SEGMENT tagUNB elements) ;
ungSegment: tagUNG elements -> ^(SEGMENT tagUNG elements) ;
unhSegment: tagUNH elements -> ^(SEGMENT tagUNH elements) ;
untSegment: tagUNT elements -> ^(SEGMENT tagUNT elements) ;
uneSegment: tagUNE elements -> ^(SEGMENT tagUNE elements) ;
unzSegment: tagUNZ elements -> ^(SEGMENT tagUNZ elements) ;

tagUNA: { input.LT(1).Text == "UNA" }? TEXT ;
tagUNB: { input.LT(1).Text == "UNB" }? TEXT ;
tagUNG: { input.LT(1).Text == "UNG" }? TEXT ;
tagUNH: { input.LT(1).Text == "UNH" }? TEXT ;
tagUNT: { input.LT(1).Text == "UNT" }? TEXT ;
tagUNE: { input.LT(1).Text == "UNE" }? TEXT ;
tagUNZ: { input.LT(1).Text == "UNZ" }? TEXT ;

segment: tag elements -> ^(SEGMENT tag elements);
//{ Console.WriteLine("Found segment: " + $tag.text); } 
tag:TEXT
;

ignoredLine: unknownDiscriminator! elements! ;

unknownDiscriminator: TEXT;


elements: element* terminator! ;
element: ELEMENT_SEPARATOR t=components? -> ^(ELEMENT $t?) ;
components: comp1 comp2* ;

comp2: COMPONENT_SEPARATOR t=TEXT? -> ^(COMPONENT $t?)
;

comp1: t=TEXT -> ^(COMPONENT $t)
| COMPONENT_SEPARATOR t=TEXT? -> ^(COMPONENT) ^(COMPONENT $t?) ;

terminator: SEGMENT_TERMINATOR ;

//terminator: (EOL | WS)+ ;

SEGMENT_TERMINATOR: '\'';
ELEMENT_SEPARATOR: '+';
COMPONENT_SEPARATOR:  ':';
//TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|'?')+ ;
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'-'|','|'.'|'/'|ESCAPE)+ ;
fragment ESCAPE : '?' . ;

WS: ('\r'? '\n')+ { $channel = 99; }  ;


error(10):  internal error: <>.g : java.util.NoSuchElementException: 
no such attribute: description in template context [out
putFile parser genericParser(...) cyclicDFA if(dfa.specialStateSTs)_subtemplate 
anonymous cyclicDFAState cyclicDFAEdge notPredicate eva
lPredicate(...)]
org.antlr.stringtemplate.StringTemplate.checkNullAttributeAgainstFormalArguments(StringTemplate.java:1276)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:800)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.get(StringTemplate.java:798)
org.antlr.stringtemplate.StringTemplate.getAttribute(StringTemplate.java:682)
org.antlr.stringtemplate.language.ActionEvaluator.attribute(ActionEvaluator.java:360)
org.antlr.stringtemplate.language.ActionEvaluator.expr(ActionEvaluator.java:136)
org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:84)
org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:148)
org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:700)
org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:722)
org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:659)
org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:148)
org.a

[il-antlr-interest: 24287] [antlr-interest] Multi-phase tree rewriting question

2009-06-19 Thread Dukie Banderjee

Hi,

I'm considering how to solve the following problem: The documents I'm parsing 
(EDI) are actually wrapped in 'envelopes'. Similar to XML, the envelopes have a 
beginning line and an ending line. The lines in between are the actual 
documents which are wrapped.

However, different wrapped documents have different formats, so my current idea 
is to have a grammar for each document format, and an 'outer' grammar for the 
enveloping structure.

The envelope grammar knows about its beginning lines and ending lines, and the 
fact that there are lines in between, but it does not parse the contained lines 
into a hierarchical AST; that job is for the specific document parser.

So, the idea is that I first parse a file into an envelope AST which has this 
basic structure:

^(ENVELOPE ...
  ^(GROUP ...
^(DOCUMENT type
  ^(LINE ...)  // Straight list of lines
  ^(LINE ...)
  ...
)
...
  )
  ...
)

Then the next phase would be to rewrite the DOCUMENT AST to add some structure 
to it, based on the EDI standards:

^(DOCUMENT type

  ^(LINE ...)  // Straight list of lines

  ^(LINE ...)

  ...

)


becomes:

^(DOCUMENT type

  ^(HEADER_INFO ...)  // More-detailed AST

  ^(ITEMS
^(ITEM
  ^(DETAIL ...)   // Hierarchical structure added


  ^(DETAIL ...)
  ...
)
  ...
  )


  ...

)


A major reason to go with this approach, I think, is that there are literally 
hundreds of different types of EDI documents (although I'm only concerned with 
parsing fewer than ten of these types), but the enveloping structure is exactly 
the same, regardless of the contained documents.

Also, there are different versions of the EDI standards, and so I would need 
slightly different parsers for different versions of the same document type. 
For example, an Edifact DELFOR document may be based on the 96a standard, or 
the 07b standard, or whatever, with slightly different (sometimes largely 
different) structures. So, I could have Delfor96aParser, and Delfor07bParser, 
and the information returned by the envelope grammar would allow me to select 
which version of EDI is being used, and therefore which document-parser to 
invoke.

So, if I had this envelop parser, which can generate a document-AST which is a 
straight list of lines, how would I then further process this AST to add some 
structure to it, using a document-specific parser?

I was looking at ANTLR's 'rewrite' option, but from what I've read it seems to 
only work with template output. I don't want to output text, I want to rewrite 
AST to AST, and then process the AST later. There is no text output needed.

Would I use a tree-walker? If so, wouldn't this require creating a whole new 
copy of the document-AST? Is this the right approach? Or is an in-place rewrite 
more appropriate? If so, how do I do it?

Any help would be appreciated, even if its just pointing me to the right place 
to find the info.

Thanks,
Rob

_
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~--~~~~--~~--~--~---


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address