grahamdic...@gmail.com wrote:
Hi
I have an excel file that is read into python (8000 rows)
from csv import reader, writer
incsv = reader(open(MY_FILE), dialect='excel')
keys = incsv.next()
There are mixed datatypes.
the last column contains a cumulative frequency running in order
0.0000 to 1.0000 for the 8000 rows
for a loop of 100,000 times I want to take a new random number each
time and find the row with the next biggest cumulative frequency value
Here's my current (pseudo)code:
for 1 to 100000
myRand = random.random()
for line in incsv:
if float(line[-1]) > myRand:
resline = []
for item in line:
try:
i = int(item)
except ValueError:
try:
i = float(item)
except ValueError:
i = item
resline.append(i)
#Here we construct a dict of pair values:
#{'ID':18,...
res = dict(zip(keys,resline))
break
else:
continue
#do some stuff with res
I'm scanning over each line of the csv and deciding which row to
select (100k times) this is just not very efficient
How can i improve this code.
for line in incsv:
if float(line[-1]) > random.random():
I can use numpy etc. whatever
Here's a suggestion:
Construct the dicts for all the rows, stored in a list.
Construct a list of just the cumulative frequencies.
For each random value, use the bisect module to search for the value in
the cumulative frequencies list. This will return an index which you can
then use on your list of dicts.
--
http://mail.python.org/mailman/listinfo/python-list