paul.scipi...@aps.com wrote:
Hello,
I'm a newbie to Python. I wrote a Python script which connect to my
Geodatabase (ESRI ArcGIS File Geodatabase), retrieves the records, then
proceeds to evaluate which ones are duplicated. I do this using lists.
Someone suggested I use arrays instead. Below is the content of my
script. Anyone have any ideas on how an array can improve performance?
Right now the script takes 2.5 minutes to run on a recordset of 79k+
records:
from __future__ import division
import sys, string, os, arcgisscripting, time
from time import localtime, strftime
def writeMessage(myMsg):
print myMsg
--- delete start ---
global log
log = open(logFile, 'a')
--- delete end ---
Open the log file once in the main part of the script.
log.write(myMsg + "\n")
logFile = "c:\\temp\\" + str(strftime("%Y%m%d %H%M%S", localtime())) +
".log"
log = open(logFile, 'a')
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' begin unique
values test')
# Create the Geoprocessor object
gp = arcgisscripting.create(9.3)
oid_list = []
dup_list = []
tmp_list = []
myWrkspc = "c:\\temp\\TVM Geodatabase GDIschema v6.0.2 PilotData.gdb"
myFtrCls = "_\\Landbase\\T_GroundContour_"
writeMessage(' ')
writeMessage('gdb: ' + myWrkspc)
writeMessage('ftr: ' + myFtrCls)
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' retrieving
recordset...')
rows = gp.SearchCursor(myWrkspc + myFtrCls,"","","GDI_OID")
row = rows.Next()
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' processing
recordset...')
--- delete start ---
while row:
if row.GDI_OID in oid_list:
tmp_list.append(row.GDI_OID)
oid_list.append(row.GDI_OID)
row = rows.Next()
--- delete end ---
You're searching a list. The longer the list becomes, the longer it'll
take to search it. Build a dict of how many times you've seen each item.
oid_count = {}
while row:
oid_count[row.GDI_OID] = oid_count.get(row.GDI_OID, 0) + 1
oid_list.append(row.GDI_OID)
row = rows.Next()
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' generating
statistics...')
-- delete start ---
dup_count = len(tmp_list)
> tmp_list = list(set(tmp_list))
> tmp_list.sort()
--- delete end ---
You now want to know which ones occurred more than once.
tmp_list = [oid for oid, count in oid_count.iteritems() if count > 1]
tmp_list.sort()
dup_count = len(tmp_list)
--- delete start ---
for oid in tmp_list:
a = str(oid) + ' '
while len(a) < 20:
a = a + ' '
dup_list.append(a + '(' + str(oid_list.count(oid)) + ')')
--- delete end ---
And here you're scanning the entire list _for every item_; if there are
'n' items then it's being scanned 'n' times!
The number of times each item occurred is now stored in oid_count.
for oid in tmp_list:
a = str(oid) + ' '
while len(a) < 20:
a = a + ' '
dup_list.append(a + '(' + str(oid_count[oid]) + ')')
for dup in dup_list:
writeMessage(dup)
writeMessage(' ')
writeMessage('records : ' + str(len(oid_list)))
writeMessage('duplicates : ' + str(dup_count))
writeMessage('% errors : ' + str(round(dup_count / len(oid_list), 4)))
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' unique values
test complete')
log.close()
--- delete start ---
del dup, dup_count, dup_list, gp, log, logFile, myFtrCls, myWrkspc
del oid, oid_list, row, rows, tmp_list
exit()
--- delete end ---
Not necessary, and it'll exit at the end of the script anyway.
--
http://mail.python.org/mailman/listinfo/python-list