Re: speed up pandas calculation

2014-07-30 Thread Vincent Davis
On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro wrote: > > df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', > low_memory=False) > > col_init = list(df.columns.values) > > keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', > 'MED2', 'MED3', 'MED4', 'MED5'] > > for col in

Re: speed up pandas calculation

2014-07-30 Thread Rustom Mody
On Thursday, July 31, 2014 7:58:59 AM UTC+5:30, Skip Montanaro wrote: > As I am learning (often painfully) with pandas and JavaScript+(d3 or > jQuery), loops are the enemy. You want to operate on large chunks of > data simultaneously. In pandas, those chunks are thinly disguised > numpy arrays. In

Re: speed up pandas calculation

2014-07-30 Thread Skip Montanaro
On Wed, Jul 30, 2014 at 8:11 PM, Chris Kaynor wrote: > Another way to write this, using a list expression (untested): > new_df = [col for col in df if col.value in keep_col] As I am learning (often painfully) with pandas and JavaScript+(d3 or jQuery), loops are the enemy. You want to operate on l

Re: speed up pandas calculation

2014-07-30 Thread Skip Montanaro
(Now that I'm on a real keyboard, more complete responses are a bit easier.) Regarding the issue of missing columns from keep_col, you could create sets of what you have and what you want, and toss the rest: toss_these = list(set(df.columns) - set(keep_col)) del df[toss_these] Or something to th

Re: speed up pandas calculation

2014-07-30 Thread Steven D'Aprano
On Wed, 30 Jul 2014 18:57:15 -0600, Vincent Davis wrote: > On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis > wrote: > >> The real slow part seems to be >> for n in drugs: >> df[n] = >> df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) >> >> > ​I was wrong, this is fast, it was

Re: speed up pandas calculation

2014-07-30 Thread Chris Kaynor
On Wed, Jul 30, 2014 at 5:57 PM, Vincent Davis wrote: > > On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis > wrote: > >> The real slow part seems to be >> for n in drugs: >> df[n] = >> df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) >> > > ​I was wrong, this is fast, it was sel

Re: speed up pandas calculation

2014-07-30 Thread Steven D'Aprano
On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote: > I know this is a general python list and I am asking about pandas but > this question is probably not great for asking on stackoverflow. I have > a list of files (~80 files, ~30,000 rows) I need to process with my > current code it is take

Re: speed up pandas calculation

2014-07-30 Thread Vincent Davis
On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis wrote: > The real slow part seems to be > for n in drugs: > df[n] = > df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) > ​I was wrong, this is fast, it was selecting the columns that was slow. using keep_col = ['PATCODE', 'PATWT'

Re: speed up pandas calculation

2014-07-30 Thread Skip Montanaro
> df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) > col_init = list(df.columns.values) > keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] > for col in col_init: > if col not in keep_col: > del df[col] I'm n

Re: speed up pandas calculation

2014-07-30 Thread Mark Lawrence
On 31/07/2014 00:04, Vincent Davis wrote: I know this is a general python list and I am asking about pandas but this question is probably not great for asking on stackoverflow. I have a list of files (~80 files, ~30,000 rows) I need to process with my current code it is take minutes for each file

speed up pandas calculation

2014-07-30 Thread Vincent Davis
I know this is a general python list and I am asking about pandas but this question is probably not great for asking on stackoverflow. I have a list of files (~80 files, ~30,000 rows) I need to process with my current code it is take minutes for each file. Any suggestions of a fast way. I am try to