On 19/10/17 20:04, Israel Brewster wrote: >> then loop through the records, find the schedule for that record (if any, if >> not move on as mentioned earlier), compare t1 and t2 against the schedule, >> and increment the appropriate bin counts using a bunch of if statements. >> Functional, if ugly. But then I got to thinking: I keep hearing about all >> these fancy numerical analysis tools for python like pandas and numpy - >> could something like that help? Might there be a way to simply set up a >> table with "rules" for the columns and rows, and drop my records into the >> table, having them automatically counted into the proper bins or something? >> Or am I over thinking this, and the "simple", if ugly approach is best?
The numerical packages can do a lot of things; in this kind of case, mostly hiding loops in C code. You'll have to be the judge of whether it's helpful. I'll just try to give you an idea of what you might do. If you had two pandas/numpy series/arrays of numpy datetimes called "scheduled" and "actual", you could get all the delays import pandas as pd delays = pd.Series(actual - scheduled) then index those by the actual time delays.index = actual and select bits of it as you please, e.g. from pandas.tseries.offsets import Week today = pd.to_datetime('today') wtd_delays = delays[today-week:] You can construct boolean mask arrays for certain conditions minute_delta = pd.DateOffset(minutes=1).delta wtd_is_between_1_and_5_min = ((wtd_delays >= 1*minute_delta) & (wtd_delays < 5*minute_delta)) and either get all the affected datapoints wtd_delays[wtd_is_between_1_and_5_min] or count them np.count_nonzero(wtd_is_between_1_and_5_min) If you have a larger table in a pandas DataFrame with more information, and you want to get those rows which fit a particular requirement, you can do that too. Something like some_table = dataframe_conjured_out_of_thin_air() delay = some_table['actual_time'] - some_table['proper_civilised_time'] mtd = today - pd.DateOffset(months=1) data_mtd_5min_or_later = some_table[mtd:][delay[mtd:] >= 5*minute_delta] Or something like that. If you do a lot of this kind of stuff (sifting through largish datasets), learning how to use pandas might be an excellent idea, but it will of course involve a fair amount of scrolling through documentation, googling, and reading stack overflow posts. Some pointers: http://pandas.pydata.org/pandas-docs/stable/ http://pandas.pydata.org/pandas-docs/stable/timeseries.html https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html#boolean-or-mask-index-arrays > > I suppose I should mention: my data source is the results of a psycopg2 > query, so a "record" is a tuple or dictionary (depending on how I want to set > up the cursor) In this case, if you want speed, you're far better off doing most of the work in SQL rather than Python! If you want clarity, maybe what you're doing now is already good enough. Or maybe using more complex SQL would actually be clearer. Pandas sits somewhere in between, and IMHO only gives you significant added value (if your data is already in a database) if you want to do some form of statistical analysis, some other kind of more complex computation, or if you want to create plots. -- Thomas -- https://mail.python.org/mailman/listinfo/python-list