On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro <skip.montan...@gmail.com> wrote:
> > df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', > low_memory=False) > > col_init = list(df.columns.values) > > keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', > 'MED2', 'MED3', 'MED4', 'MED5'] > > for col in col_init: > > if col not in keep_col: > > del df[col] > > I'm no pandas expert, but a couple things come to mind. First, where is > your code slow (profile it, even with a few well-placed prints)? If it's in > read_csv there might be little you can do unless you load those data > repeatedly, and can save a pickled data frame as a caching measure. Second, > you loop over columns deciding one by one whether to keep or toss a column. > Instead try > > df = df[keep_col] > > Third, if deleting those other columns is costly, can you perhaps just > ignore them? > > Can't be more investigative right now. I don't have pandas on Android. :-) > So the df = df[keep_col] is not fast but it is not that slow. You made me think of a solution to that part. just slice and copy. The only gotya is that the keep_col have to actually exist keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] df = df[keep_col] The real slow part seems to be for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) Vincent Davis 720-301-3003
-- https://mail.python.org/mailman/listinfo/python-list