How to efficiently extract information from structured text file
Hi, I am trying to read object information from a text file (approx. 30,000 lines) with the following format, each line corresponds to a line in the text file. Currently, the whole file was read into a string list using readlines(), then use for loop to search the "= {" and "};" to determine the Object, SubObject,and SubSubObject. My questions are 1) Is there any efficient method that I can search the whole string list to find the location of the tokens(such as '= {' or '};' 2) Is there any efficient ways to extract the object information you may suggest? Thanks, - Jeremy = Structured text file = Object1 = { ... SubObject1 = { SubSubObject1 = { ... }; }; SubObject2 = { SubSubObject21 = { ... }; }; SubObjectN = { SubSubObjectN = { ... }; }; }; -- http://mail.python.org/mailman/listinfo/python-list
Re: How to efficiently extract information from structured text file
On Feb 16, 7:14 pm, Gary Herron wrote: > Imaginationworks wrote: > > Hi, > > > I am trying to read object information from a text file (approx. > > 30,000 lines) with the following format, each line corresponds to a > > line in the text file. Currently, the whole file was read into a > > string list using readlines(), then use for loop to search the "= {" > > and "};" to determine the Object, SubObject,and SubSubObject. My > > questions are > > > 1) Is there any efficient method that I can search the whole string > > list to find the location of the tokens(such as '= {' or '};' > > Yes. Read the *whole* file into a single string using file.read() > method, and then search through the string using string methods (for > simple things) or use re, the regular expression module, (for more > complex searches). > > Note: There is a point where a file becomes large enough that reading > the whole file into memory at once (either as a single string or as a > list of strings) is foolish. However, 30,000 lines doesn't push that > boundary. > > > 2) Is there any efficient ways to extract the object information you > > may suggest? > > Again, the re module has nice ways to find a pattern, and return parse > out pieces of it. Building a good regular expression takes time, > experience, and a bit of black magic... To do so for this case, we > might need more knowledge of your format. Also regular expressions have > their limits. For instance, if the sub objects can nest to any level, > then in fact, regular expressions alone can't solve the whole problem, > and you'll need a more robust parser. > > > Thanks, > > > - Jeremy > > > = Structured text file = > > Object1 = { > > > ... > > > SubObject1 = { > > > > > SubSubObject1 = { > > ... > > }; > > }; > > > SubObject2 = { > > > > > SubSubObject21 = { > > ... > > }; > > }; > > > SubObjectN = { > > > > > SubSubObjectN = { > > ... > > }; > > }; > > }; > > Gary and Rhodri, Thank you for the suggestions. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to efficiently extract information from structured text file
On Feb 17, 1:40 pm, Paul McGuire wrote: > On Feb 16, 5:48 pm, Imaginationworks wrote: > > > Hi, > > > I am trying to read object information from a text file (approx. > > 30,000 lines) with the following format, each line corresponds to a > > line in the text file. Currently, the whole file was read into a > > string list using readlines(), then use for loop to search the "= {" > > and "};" to determine the Object, SubObject,and SubSubObject. > > If you open(filename).read() this file into a variable named data, the > following pyparsing parser will pick out your nested brace > expressions: > > from pyparsing import * > > EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};") > ident = Word(alphas, alphanums) > contents = Forward() > defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI)) > > contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables)) > > results = defn.parseString(data) > > print results > > Prints: > > [ > ['Object1', > ['...', > ['SubObject1', > ['', > ['SubSubObject1', > ['...'] > ] > ] > ], > ['SubObject2', > ['', > ['SubSubObject21', > ['...'] > ] > ] > ], > ['SubObjectN', > ['', > ['SubSubObjectN', > ['...'] > ] > ] > ] > ] > ] > ] > > -- Paul Wow, that is great! Thanks -- http://mail.python.org/mailman/listinfo/python-list