Re: Elementary string-parsing

Steve Holden Tue, 05 Feb 2008 05:22:49 -0800

Dennis Lee Bieber wrote:
> On Tue, 05 Feb 2008 04:03:04 GMT, Odysseus
> <[EMAIL PROTECTED]> declaimed the following in
> comp.lang.python:
> 
>> Sorry, translation problem: I am acquainted with Python's "for" -- if 
>> far from fluent with it, so to speak -- but the PS operator that's most 
>> similar (traversing a compound object, element by element, without any 
>> explicit indexing or counting) is called "forall". PS's "for" loop is 
>> similar to BASIC's (and ISTR Fortran's):
>>
>> start_value increment end_value {procedure} for
>>
>> I don't know the proper generic term -- "indexed loop"? -- but at any 
>> rate it provides a counter, unlike Python's command of the same name.
>>
>       The convention is Python is to use range() (or xrange() ) to
> generate a sequence of "index" values for the for statement to loop
> over:
> 
>       for i in range([start], end, [step]):
> 
> with the caveat that "end" will not be one of the values, start defaults
> to 0, so if you supply range(4) the values become 0, 1, 2, 3 [ie, 4
> values starting at 0].
>  
If you have a sequence of values s and you want to associate each with 
its index value as you loop over the sequence the easiest way to do this 
is the enumerate built-in function:


 >>> for x in enumerate(['this', 'is', 'a', 'list']):
...   print x
...
(0, 'this')
(1, 'is')
(2, 'a')
(3, 'list')

It's usually (though not always) much more convenient to bind the index 
and the value to separate names, as in

 >>> for i, v in enumerate(['this', 'is', 'a', 'list']):
...   print i, v
...
0 this
1 is
2 a
3 list

[...]
>       The whole idea behind the SGML parser is that YOU add methods to
> handle each tag type you need... Also, FYI, there IS an HTML parser (in
> module htmllib) that is already derived from sgmllib.
> 
> class PageParser(SGMLParser):
>       def __init__(self):
>               #need to call the parent __init__, and then
>               #initialize any needed attributes -- like someplace to collect
>               #the parsed out cell data
>               self.row = {}
>               self.all_data = []
> 
>       def     start_table(self, attrs):
>               self.inTable = True
>               .....
> 
>       def end_table(self):
>               self.inTable = False
>               .....
> 
>       def start_tr(self, attrs):
>               if self.inRow:
>                       #unclosed row!
>                       self.end_tr()
>               self.inRow = True
>               self.cellCount = 0
>               ...
> 
>       def end_tr(self):
>               self.inRow = False
>               # add/append collected row data to master stuff
>               self.all_data.append(self.row)
>               ...
> 
>       def start_td(self, attrs):
>               if self.inCell:
>                       self.end_td()
>               self.inCell = True
>               ...
> 
>       def end_td(self):
>               self.cellCount = self.cellCount + 1
>               ...
> 
>       def handle_data(self, text):
>               if self.inTable and self.inRow and self.inCell:
>                       if self.cellCount == 0:
>                               #first column stuff
>                               self.row["Epoch1"] = convert_if_needed(text)
>                       elif self.cellCount == 1:
>                               #second column stuff
>               ...
> 
> 
>       Hope you don't have nested tables -- it could get ugly as this style
> of parser requires the start_tag()/end_tag() methods to set instance
> attributes for the purpose of tracking state needed in later methods
> (notice the complexity of the handle_data() method just to ensure that
> the text is from a table cell, and not some random text).
> 
There is, of course, nothing to stop you building a recursive data 
structure, so that encountering a new opening tag such as <table> adds 
another level to some stack-like object, and the corresponding closing 
tag pops it off again, but this *does* add to the complexity somewhat.

It seems natural that more complex input possibilities lead to more 
complex parsers.

>       And somewhere before you close the parser, get a handle on the
> collected data...
> 
> 
>       parsed_data = parser.all_data
>       parser.close()
>       return parsed_data
> 
> 
>> Why wouldn't one use a dictionary for that?
>>
>       The overhead may not be needed... Tuples can also be used as the
> keys /in/ a dictionary.
>  
regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

Reply via email to