Thanks Andrea. I was thinking that too but I was wondering if there were any other clever ways of doing this. I also though, I can build a filesystem structure depending on the __time. So, for January 01, 2011. I would create /tmp/data/20110101/data . This way I can have a fast index of the data. And next time I read thru this file, I can skip all of Jan 01, 2011
On Sat, Feb 26, 2011 at 10:29 AM, Andrea Crotti <andrea.crott...@gmail.com>wrote: > > Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto: > > > I have a large text (4GB) which I am parsing. > > > > I am reading the file to collect stats on certain items. > > > > My approach has been simple, > > > > for row in open(file): > > if "INFO" in row: > > line=row.split() > > user=line[0] > > host=line[1] > > __time=line[2] > > ... > > > > I was wondering if there is a framework or a better algorithm to read > such as large file and collect it stats according to content. Also, are > there any libraries, data structures or functions which can be helpful? I > was told about 'collections' container. Here are some stats I am trying to > get: > > > > *Number of unique users > > *Break down each user's visit according to time, t0 to t1 > > *what user came from what host. > > *what time had the most users? > > > > (There are about 15 different things I want to query) > > > > I understand most of these are redundant but it would be nice to have a > framework or even a object oriented way of doing this instead of loading it > into a database. > > > > > > Any thoughts or ideas? > > Not an expert, but maybe it might be good to push the data into a database, > and then you can tweak the DBMS and write > smart queries to get all the statistics you want from it. > > It might take a while (maybe with regexp splitting is faster) but it's done > only once and then you work with DB tools. > > > -- --- Get your facts first, then you can distort them as you please.--
-- http://mail.python.org/mailman/listinfo/python-list