Yes, Yes :-). I was using awk to do all of this. It does work but I find myself repeating reading the same data because awk does not support complex data structures. Plus the code is getting ugly.
I was told about Orange (http://orange.biolab.si/). Does anyone have experience with it? On Sat, Feb 26, 2011 at 10:53 AM, Martin Gregorie <martin@address-in-sig.invalid> wrote: > On Sat, 26 Feb 2011 16:29:54 +0100, Andrea Crotti wrote: > > > Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto: > > > >> I have a large text (4GB) which I am parsing. > >> > >> I am reading the file to collect stats on certain items. > >> > >> My approach has been simple, > >> > >> for row in open(file): > >> if "INFO" in row: > >> line=row.split() > >> user=line[0] > >> host=line[1] > >> __time=line[2] > >> ... > >> > >> I was wondering if there is a framework or a better algorithm to read > >> such as large file and collect it stats according to content. Also, are > >> there any libraries, data structures or functions which can be helpful? > >> I was told about 'collections' container. Here are some stats I am > >> trying to get: > >> > >> *Number of unique users > >> *Break down each user's visit according to time, t0 to t1 *what user > >> came from what host. > >> *what time had the most users? > >> > >> (There are about 15 different things I want to query) > >> > >> I understand most of these are redundant but it would be nice to have a > >> framework or even a object oriented way of doing this instead of > >> loading it into a database. > >> > >> > >> Any thoughts or ideas? > > > > Not an expert, but maybe it might be good to push the data into a > > database, and then you can tweak the DBMS and write smart queries to get > > all the statistics you want from it. > > > > It might take a while (maybe with regexp splitting is faster) but it's > > done only once and then you work with DB tools. > > > This is the sort of job that is best done with awk. > > Awk processes a text file line by line, automatically splitting each line > into an array of words. It uses regexes to recognise lines and trigger > actions on them. For example, building a list of visitors: assume there's > a line containing "username logged on", you could build a list of users > and count their visits with this statement: > > /logged on/ { user[$1] += 1 } > > where the regex, /logged on/, triggers the action, in curly brackets, for > each line it matches. "$1" is a symbol for the first word in the line. > > > -- > martin@ | Martin Gregorie > gregorie. | Essex, UK > org | > -- > http://mail.python.org/mailman/listinfo/python-list > -- --- Get your facts first, then you can distort them as you please.--
-- http://mail.python.org/mailman/listinfo/python-list