How to efficiently extract information from structured text file

2010-02-16 Thread Imaginationworks
Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file.  Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

2) Is there any efficient ways to extract the object information you
may suggest?

Thanks,

- Jeremy



= Structured text file =
Object1 = {

...

SubObject1 = {


SubSubObject1 = {
...
};
};

SubObject2 = {


SubSubObject21 = {
...
};
};

SubObjectN = {


SubSubObjectN = {
...
};
};
};
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to efficiently extract information from structured text file

2010-02-17 Thread Imaginationworks
On Feb 16, 7:14 pm, Gary Herron  wrote:
> Imaginationworks wrote:
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file.  Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject. My
> > questions are
>
> > 1) Is there any efficient method that I can search the whole string
> > list to find the location of the tokens(such as '= {' or '};'
>
> Yes.   Read the *whole* file into a single string using file.read()
> method, and then search through the string using string methods (for
> simple things) or use re, the regular expression module, (for more
> complex searches).    
>
> Note:  There is a point where a file becomes large enough that reading
> the whole file into memory at once (either as a single string or as a
> list of strings) is foolish.    However, 30,000 lines doesn't push that
> boundary.
>
> > 2) Is there any efficient ways to extract the object information you
> > may suggest?
>
> Again, the re module has nice ways to find a pattern, and return parse
> out pieces of it.   Building a good regular expression takes time,
> experience, and a bit of black magic...    To do so for this case, we
> might need more knowledge of your format.  Also regular expressions have
> their limits.  For instance, if the sub objects can nest to any level,
> then in fact, regular expressions alone can't solve the whole problem,
> and you'll need a more robust parser.
>
> > Thanks,
>
> > - Jeremy
>
> > = Structured text file =
> > Object1 = {
>
> > ...
>
> > SubObject1 = {
> > 
>
> > SubSubObject1 = {
> > ...
> > };
> > };
>
> > SubObject2 = {
> > 
>
> > SubSubObject21 = {
> > ...
> > };
> > };
>
> > SubObjectN = {
> > 
>
> > SubSubObjectN = {
> > ...
> > };
> > };
> > };
>
>

Gary and Rhodri, Thank you for the suggestions.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to efficiently extract information from structured text file

2010-02-17 Thread Imaginationworks
On Feb 17, 1:40 pm, Paul McGuire  wrote:
> On Feb 16, 5:48 pm, Imaginationworks  wrote:
>
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file.  Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject.
>
> If you open(filename).read() this file into a variable named data, the
> following pyparsing parser will pick out your nested brace
> expressions:
>
> from pyparsing import *
>
> EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
> ident = Word(alphas, alphanums)
> contents = Forward()
> defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))
>
> contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))
>
> results = defn.parseString(data)
>
> print results
>
> Prints:
>
> [
>  ['Object1',
>    ['...',
>     ['SubObject1',
>       ['',
>         ['SubSubObject1',
>           ['...']
>         ]
>       ]
>     ],
>     ['SubObject2',
>       ['',
>        ['SubSubObject21',
>          ['...']
>        ]
>       ]
>     ],
>     ['SubObjectN',
>       ['',
>        ['SubSubObjectN',
>          ['...']
>        ]
>       ]
>     ]
>    ]
>  ]
> ]
>
> -- Paul

Wow, that is great! Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list