BTW, 1) is quite achievable if not selecting from multiple tables (i.e. no 
print db.dogs.name, db.person.name) in a breeze:
from gluon.storage import Storage
raw_rowset = db.executesql("thequery", as_dict=True)
myrowset = [Storage(row) for row in raw_rowset]

No time to patch DAL's executesql and test the improvement in time right 
now but if in the need I could provide it.

On Saturday, June 23, 2012 12:14:02 AM UTC+2, Niphlod wrote:
>
>
>  @pbreit : if I'm not mistaken, the "parse()" function in DAL handles a 
> lot ... basically read just any value returned from the database to the 
> correct type (blob, integer, double, datetime, .... list:*, just to say a 
> few), handles virtualfields and prepares the Rows object making the 
> Storage(). All the resultset is fetched, "parsed()" line by line, column by 
> column, and returned as a whole list. Also if operations were to be kept at 
> a minimum, building a simple list of 1 million record take some time, (and 
> memory).
> Obviously the first shortcut is use limitby. The second is selecting only 
> the needed columns (best practice anyway to limit the amount of wire 
> transfer). No other shoutcuts.
>
> @all: Giving that a million record is hardly displayed, SQLTABLEd, 
> SQLFORM.gridded, pickled, jsoned, etc.....I think the "need" here is having 
> a Storage() object instead of a tuple for accessing the values as 
> row.idinstead of row[0], etc as in current executesql.
> I see two roads (to be considered as 1 AND 2 and also as 1 OR 2) here, but 
> I don't know the implications of having to deal with SQLFORM.grid, for 
> instance. 
> 1) skip the parse() and return a quasi-rows object "a la executesql", just 
> turning the list of tuples as list of row objects, skipping virtual fields, 
> references, etc.
> 2) approach the returned values as a generator, and optionally as an 
> only-forward iterator "a la web.py iterbetter()" (
> https://github.com/webpy/webpy/blob/master/web/db.py#L646), i.e. you can 
> fetch the results and cycle through them only one time 
>
> Downsides for method 1):
> - some types could not be converted at all, references would be 
> unavailable (but for million records you should anyway have done a join), 
> no virtualfields
> - building a list of million records still requires time
> - for large datasets having to work on them only "after" all the list is 
> prepared could be taking some time
> Downsides for method 2):
> - may work as expected only if underlying db driver supports it
> - in the need of cycling the resultset for the second time another query 
> is required or you have to come up with you own memoization (i.e. prepare 
> an empty list to append results you're interested in cycling again)
>
> I used web.py a lot and also for millions of records it's very fast for 
> returning results.
> NB: the "time cut" here is the one passing from the "db is ready to return 
> results" to "the queryset is done and prepared for me to work on it". 
> The assumption on 2) is that rarely you need to access randomly a million 
> of records - i.e. myrows[31289], myrows[129459] - and what you can do in 2 
> loops on the same dataset is usually "accomplishable" (that word even exist 
> ? :D) in a single one.
>
>

-- 



Reply via email to