RE: regular expression extracting groups

Edwin . Madari Sun, 10 Aug 2008 07:13:38 -0700

if its *NOT* an exercise in re,  and if input is a bunch of lines within '{' 
and '}' and each line is key="value" pairs, I would not go near re. instead 
simply parse keys and array of values into a dictionary, and process them from 
the dictionary as below, and the key option correctly has 2 entries 'value' and 
'7' in the right order. will work with any input...
 
# assuming variable s has the string......
s = """{
option=value
foo=bar
another=42
option=7
}"""


>>> for line in s.split():
..     ix = line.find('=')
..     if ix >= 0:
..             key = line[:ix]
..             val = line[ix + 1: ]
..             try:
..                     data[key].append(val)
..             except KeyError:
..                     data.setdefault(key, [val])
.. 
>>> 
>>> 
>>> for k, v in data.items():
..     print 'key=%s val=%s' % (k, v)
..             
.. 
key=foo val=['bar']
key=option val=['value', '7']
key=another val=['42']

with another dictionary of keys to be processed with a function  to process 
values for that key, its a matter of iterating over keys..  

hope that simplifies and helps.. 
thx Edwin



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
On Behalf Of [EMAIL PROTECTED]
Sent: Sunday, August 10, 2008 8:30 AM
To: python-list@python.org
Subject: regular expression extracting groups


Hi list,

I'm trying to use regular expressions to help me quickly extract the
contents of messages that my application will receive. I have worked
out most of the regex but the last section of the message has me
stumped. This is mostly because I want to pull the content out into
regex groups that I can easily access later. I have a regex to extract
the key/value pairs but it ends up with only the contents of the last
key/value pair encountered.

An example of the section of the message that is troubling me appears
like this:

{
option=value
foo=bar
another=42
option=7
}

So it's basically a bunch of lines. Every line is terminated with a
'\n' character. The number of key/value fields changes depending on
the particular message. Also notice that there are two 'option' keys.
This is allowable and I need to cater for it.


A couple of example messages are:
xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=*\n}
\nhbeat.basic\n{\ninterval=10\n}\n

xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=vendor-
device.instance\n}\nconfig.list\n{\nreconf=newconf\noption=interval
\noption=group[16]\noption=filter[16]\n}\n


As all messages follow the same pattern I'm hoping to develop a
generic regex, instead of one for each message kind - because there
are many, that can pull a message from a received packet.



The regex I came up with looks like this:
# This should match any xPL message

GROUP_MESSAGE_TYPE = 'message_type'
GROUP_HOP = 'hop'
GROUP_SOURCE = 'source'
GROUP_TARGET = 'target'
GROUP_SRC_VENDOR_ID = 'source_vendor_id'
GROUP_SRC_DEVICE_ID = 'source_device_id'
GROUP_SRC_INSTANCE_ID = 'source_instance_id'
GROUP_TGT_VENDOR_ID = 'target_vendor_id'
GROUP_TGT_DEVICE_ID = 'target_device_id'
GROUP_TGT_INSTANCE_ID = 'target_instance_id'
GROUP_IDENTIFIER_TYPE = 'identifier_type'
GROUP_SCHEMA = 'schema'
GROUP_SCHEMA_CLASS = 'schema_class'
GROUP_SCHEMA_TYPE = 'schema_type'
GROUP_OPTION_KEY = 'key'
GROUP_OPTION_VALUE = 'value'


XplMessageGroupsRe = r'''(?P<%s>xpl-(cmnd|stat|trig))
\n                 # message type
   \
{\n
#
   hop=(?P<%s>[1-9]{1})
\n                                                              # hop
count
   source=(?P<%s>(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16}))\n  # source identifier
   target=(?P<%s>(\*|(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16})))\n  # target identifier
   \}
\n
#
   (?P<%s>(?P<%s>[a-z0-9]{1,8})\.(?P<%s>[a-z0-9]{1,8}))\n
# schema
   \
{\n
#
   (?:(?P<%s>[a-z0-9\-]{1,16})=(?P<%s>[\x20-\x7E]{0,128})\n){1,64}   #
key/value pairs
   \}\n''' % (GROUP_MESSAGE_TYPE,
              GROUP_HOP,
              GROUP_SOURCE,
              GROUP_SRC_VENDOR_ID,
              GROUP_SRC_DEVICE_ID,
              GROUP_SRC_INSTANCE_ID,
              GROUP_TARGET,
              GROUP_TGT_VENDOR_ID,
              GROUP_TGT_DEVICE_ID,
              GROUP_TGT_INSTANCE_ID,
              GROUP_SCHEMA,
              GROUP_SCHEMA_CLASS,
              GROUP_SCHEMA_TYPE,
              GROUP_OPTION_KEY,
              GROUP_OPTION_VALUE)

XplMessageGroups = re.compile(XplMessageGroupsRe, re.VERBOSE |
re.DOTALL)


If I pass the second example message through this regex the 'key'
group ends up containing 'option' and the 'value' group ends up
containing 'filter[16]' which are the last key/value pairs in that
message.

So the problem I have lies in the key/value regex extraction section.
It handles multiple occurrences of the pattern and writes the content
into the single key/value group hence I can't extract and access all
fields.

Is there some other way to do this which allows me to store all the
key/value pairs into the regex match object for later retrieval?
Perhaps using the standard unnamed number groups?

Thanks,
Chris
--
http://mail.python.org/mailman/listinfo/python-list


The information contained in this message and any attachment may be
proprietary, confidential, and privileged or subject to the work
product doctrine and thus protected from disclosure.  If the reader
of this message is not the intended recipient, or an employee or
agent responsible for delivering this message to the intended
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify me
immediately by replying to this message and deleting it and all
copies and backups thereof.  Thank you.


--
http://mail.python.org/mailman/listinfo/python-list

RE: regular expression extracting groups

Reply via email to