Seb wrote: > Hello, > > Given a list of files: > > In [81]: ec_files[0:10] > Out[81]: > > [u'EC_20160604002000.csv', > u'EC_20160604010000.csv', > u'EC_20160604012000.csv', > u'EC_20160604014000.csv', > u'EC_20160604020000.csv'] > > where the numbers are are a timestamp with format %Y%m%d%H%M%S, I'd like > to generate a list of matching files for each 2-hr period in a 2-h > frequency time series. Ultimately I'm using Pandas to read and handle > the data in each group of files. For the task of generating the files > for each 2-hr period, I've done the following: > > beg_tstamp = pd.to_datetime(ec_files[0][-18:-4], > format="%Y%m%d%H%M%S") > end_tstamp = pd.to_datetime(ec_files[-1][-18:-4], > format="%Y%m%d%H%M%S") > tstamp_win = pd.date_range(beg_tstamp, end_tstamp, freq="2H") > > So tstamp_win is the 2-hr frequency time series spanning the timestamps > in the files in ec_files. > > I've generated the list of matching files for each tstamp_win using a > comprehension: > > win_files = [] > for i, w in enumerate(tstamp_win): > nextw = w + pd.Timedelta(2, "h") > ifiles = [x for x in ec_files if > pd.to_datetime(x[-18:-4], format="%Y%m%d%H%M%S") >= w and > pd.to_datetime(x[-18:-4], format="%Y%m%d%H%M%S") < nextw] > win_files.append(ifiles) > > However, this is proving very slow, and was wondering whether there's a > better/faster way to do this. Any tips would be appreciated.
Is win_files huge? Then it might help to avoid going over the entire list for every interval. Instead you can sort the list and then add to the current list while you are below nextw. My pandas doesn't seem to have Timedelta (probably it's too old), so here's a generic solution using only the stdlib: $ cat group_2hours.py import itertools import datetime import pprint def filename_to_time(filename): return datetime.datetime.strptime(filename[-18:-4], "%Y%m%d%H%M%S") def make_key(delta_t): upper_bound = None def key(filename): nonlocal upper_bound if upper_bound is None: upper_bound = filename_to_time(filename) + delta_t else: t = filename_to_time(filename) while t >= upper_bound: # needs work if there are large gaps upper_bound += delta_t return upper_bound return key ec_files = [ u'EC_20160604002000.csv', u'EC_20160604010000.csv', u'EC_20160604012000.csv', u'EC_20160604014000.csv', u'EC_20160604020000.csv', u'EC_20160604050000.csv', u'EC_20160604060000.csv', u'EC_20160604070000.csv', ] ec_files.sort() # ensure filenames are in ascending order TWO_HOURS = datetime.timedelta(hours=2) win_files = [ list(group) for _key, group in itertools.groupby(ec_files, key=make_key(TWO_HOURS)) ] pprint.pprint(win_files) $ python3 group_2hours.py [['EC_20160604002000.csv', 'EC_20160604010000.csv', 'EC_20160604012000.csv', 'EC_20160604014000.csv', 'EC_20160604020000.csv'], ['EC_20160604050000.csv', 'EC_20160604060000.csv'], ['EC_20160604070000.csv']] $ PS: If the files' prefixes differ you cannot sort by name. Instead use ec_files.sort(key=filename_to_time) PPS: There is probably a way to do this by converting the list to a pandas dataframe; it might be worthwhile to ask in a specialised forum. -- https://mail.python.org/mailman/listinfo/python-list