BTW, 1) is quite achievable if not selecting from multiple tables (i.e. no print db.dogs.name, db.person.name) in a breeze: from gluon.storage import Storage raw_rowset = db.executesql("thequery", as_dict=True) myrowset = [Storage(row) for row in raw_rowset]
No time to patch DAL's executesql and test the improvement in time right now but if in the need I could provide it. On Saturday, June 23, 2012 12:14:02 AM UTC+2, Niphlod wrote: > > > @pbreit : if I'm not mistaken, the "parse()" function in DAL handles a > lot ... basically read just any value returned from the database to the > correct type (blob, integer, double, datetime, .... list:*, just to say a > few), handles virtualfields and prepares the Rows object making the > Storage(). All the resultset is fetched, "parsed()" line by line, column by > column, and returned as a whole list. Also if operations were to be kept at > a minimum, building a simple list of 1 million record take some time, (and > memory). > Obviously the first shortcut is use limitby. The second is selecting only > the needed columns (best practice anyway to limit the amount of wire > transfer). No other shoutcuts. > > @all: Giving that a million record is hardly displayed, SQLTABLEd, > SQLFORM.gridded, pickled, jsoned, etc.....I think the "need" here is having > a Storage() object instead of a tuple for accessing the values as > row.idinstead of row[0], etc as in current executesql. > I see two roads (to be considered as 1 AND 2 and also as 1 OR 2) here, but > I don't know the implications of having to deal with SQLFORM.grid, for > instance. > 1) skip the parse() and return a quasi-rows object "a la executesql", just > turning the list of tuples as list of row objects, skipping virtual fields, > references, etc. > 2) approach the returned values as a generator, and optionally as an > only-forward iterator "a la web.py iterbetter()" ( > https://github.com/webpy/webpy/blob/master/web/db.py#L646), i.e. you can > fetch the results and cycle through them only one time > > Downsides for method 1): > - some types could not be converted at all, references would be > unavailable (but for million records you should anyway have done a join), > no virtualfields > - building a list of million records still requires time > - for large datasets having to work on them only "after" all the list is > prepared could be taking some time > Downsides for method 2): > - may work as expected only if underlying db driver supports it > - in the need of cycling the resultset for the second time another query > is required or you have to come up with you own memoization (i.e. prepare > an empty list to append results you're interested in cycling again) > > I used web.py a lot and also for millions of records it's very fast for > returning results. > NB: the "time cut" here is the one passing from the "db is ready to return > results" to "the queryset is done and prepared for me to work on it". > The assumption on 2) is that rarely you need to access randomly a million > of records - i.e. myrows[31289], myrows[129459] - and what you can do in 2 > loops on the same dataset is usually "accomplishable" (that word even exist > ? :D) in a single one. > > --