Do take a look in the fairly recent archives of this list for a big discussion of groupby -- it kind of petered out but there were a couple options on the table.
-CHB On Sun, Feb 10, 2019 at 9:23 PM Kyle Lahnakoski <[email protected]> wrote: > > On 2019-02-10 18:30, Steven D'Aprano wrote: > > > > Can you post a simplified example of how you would do it in SQL, > > compared to what you would have to do in standard Python? > > Can I do the same in standard Python? If I did, then I would use Pandas: > it has groupby, and some primitive joining, and window functions may > come naturally because of its imperative nature, but I have not tried > it. If I can not use Pandas, then I would write the groupby and window > functions and call them in sequence. This is similar to what you see in > my code now: a number of properties who's values get dispatched to > Python functions. My code is more complicated only because those > structures can be dispatched to translators for databases too. > > I am certain there are many variations of groupby out in the wild, and > it would be nice to have the concept standardized when/if Python has > vector operations. Join would be nice to have too, but I do not use it > much; dictionary lookup seems to fill that need. Window functions > (which are like mini queries) are powerful, but like Pandas, may end up > end up being free because Python is imperative. > > My code I pointed to has two parts. Here is the first part in SQL (well, > an approximation of SQL since I did not test this, and now I am rusty). > A detailed description is below > > | WITH time_range AS ( > | SELECT > | num > | FROM > | all_integers > | WHERE > | num % 60 =0 AND > | num >= floor(<<now>>/60/60)*60*60-<<start_Of_history>> AND > | num < floor(<<now>>/60/60) + 60*60 > | ) > | SELECT > | availability_zone, > | instance_type, > | time_range.num AS time > | MAX(price) as PRICE, > | COUNT(1) AS `COUNT`, > | LAST(current_price) OVER ( > | PARTITION BY > | availability_zone, > | instance_type > | ORDER BY > | timestamp > | ) AS current_price > | FROM > | ( > | SELECT > | *, > | COALESCE(LAG(timestampvalue, 1), <<end_of_day>>) OVER ( > | PARTITION BY > | availability_zone, > | instance_type > | ORDER BY > | timestamp > | ) AS expire, > | timestamp-<<expected_uptime>> AS effective > | FROM > | prices > | ) temp > | RIGHT JOIN > | time_range ON time_range.num BETWEEN temp.effective AND temp.expire > | GROUP BY > | availability_zone, > | instance_type, > | time_range.num AS time > | WHERE > | expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>> > > > Now, for the same, with description: > > This WITH clause is not real SQL; it is meant to stand in for a > temporary table that contains all hours of the time range I am > interested. Definitely easier to do in Python. All time is assumed to be > in seconds since epoch. > > | WITH time_range AS ( > | SELECT > | num > | FROM > | all_integers > | WHERE > | num % 60 =0 AND > | num >= floor(<<now>>/60/60)*60*60-<<start_of_history>> AND > | num < floor(<<now>>/60/60) + 60*60 > | ) > > We will select the three dimensions we are interested in (see GROUP BY > below), along with the MAX price we have seen in the given hour, and the > current_price for any (availability_zone, instance_type) pair. > > | SELECT > | availability_zone, > | instance_type, > | time_range.num AS time > | MAX(price) as PRICE, > | COUNT(1) AS `COUNT`, > | LAST(current_price) OVER ( > | PARTITION BY > | availability_zone, > | instance_type > | ORDER BY > | timestamp > | ) AS current_price > | FROM > > The prices coming from Amazon only have a timestamp for when that price > is effective; so this sub-query adds an `effective` start time, and an > `expire` time so the rest of the query need only deal with ranges. The > timestamp-<<expected_uptime>> is putting the start time back further > into the past so the past can "see" future pricing. > > | ( > | SELECT > | *, > | COALESCE(LAG(timestamp, 1), <<end_of_day>>) OVER ( > | PARTITION BY > | availability_zone, > | instance_type > | ORDER BY > | timestamp > | ) AS expire, > | timestamp-<<expected_uptime>> AS effective > | FROM > | prices > | ) temp > > This is the point where we use the time_range from above and find every > hour a price is effective. This could have been a sub-query, but I am > rusty at SQL > > | RIGHT JOIN > | time_range ON time_range.num BETWEEN temp.effective AND temp.expire > > These are the three dimensions we are interested in > > | GROUP BY > | availability_zone, > | instance_type, > | time_range.num AS time > > and we are only interested in calculating back to a certain point > > | WHERE > | expire > floor(<<now>>/60/60)*60*60 - <<start_Of_history>> > > > > > > > _______________________________________________ > Python-ideas mailing list > [email protected] > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
_______________________________________________ Python-ideas mailing list [email protected] https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
