Re: Find duplicates in a list/array and count them ...

MRAB Fri, 27 Mar 2009 12:45:27 -0700

[email protected] wrote:

Hello,
I'm a newbie to Python. I wrote a Python script which connect to myGeodatabase (ESRI ArcGIS File Geodatabase), retrieves the records, thenproceeds to evaluate which ones are duplicated. I do this using lists.Someone suggested I use arrays instead. Below is the content of myscript. Anyone have any ideas on how an array can improve performance?Right now the script takes 2.5 minutes to run on a recordset of 79k+records:from __future__ import division
import sys, string, os, arcgisscripting, time
from time import localtime, strftime
def writeMessage(myMsg):
    print myMsg

--- delete start ---

    global log
    log = open(logFile, 'a')

--- delete end ---


Open the log file once in the main part of the script.

    log.write(myMsg + "\n")
logFile = "c:\\temp\\" + str(strftime("%Y%m%d %H%M%S", localtime())) +".log"


log = open(logFile, 'a')

writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' begin uniquevalues test')# Create the Geoprocessor object
gp = arcgisscripting.create(9.3)
oid_list = []
dup_list = []
tmp_list = []
myWrkspc = "c:\\temp\\TVM Geodatabase GDIschema v6.0.2 PilotData.gdb"
myFtrCls = "_\\Landbase\\T_GroundContour_"
writeMessage(' ')
writeMessage('gdb: ' + myWrkspc)
writeMessage('ftr: ' + myFtrCls)
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' retrievingrecordset...')rows = gp.SearchCursor(myWrkspc + myFtrCls,"","","GDI_OID")
row = rows.Next()
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' processingrecordset...')


--- delete start ---

while row:
    if row.GDI_OID in oid_list:
        tmp_list.append(row.GDI_OID)
    oid_list.append(row.GDI_OID)
    row = rows.Next()

--- delete end ---

You're searching a list. The longer the list becomes, the longer it'll
take to search it. Build a dict of how many times you've seen each item.

oid_count = {}
while row:
    oid_count[row.GDI_OID] = oid_count.get(row.GDI_OID, 0) + 1
    oid_list.append(row.GDI_OID)
    row = rows.Next()

writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' generatingstatistics...')

-- delete start ---

dup_count = len(tmp_list)

> tmp_list = list(set(tmp_list))
> tmp_list.sort()
--- delete end ---

You now want to know which ones occurred more than once.

tmp_list = [oid for oid, count in oid_count.iteritems() if count > 1]
tmp_list.sort()
dup_count = len(tmp_list)

--- delete start ---

for oid in tmp_list:
    a = str(oid) + '     '
    while len(a) < 20:
        a = a + ' '
    dup_list.append(a + '(' + str(oid_list.count(oid)) + ')')

--- delete end ---

And here you're scanning the entire list _for every item_; if there are'n' items then it's being scanned 'n' times!


The number of times each item occurred is now stored in oid_count.

for oid in tmp_list:
    a = str(oid) + '     '
    while len(a) < 20:
        a = a + ' '
    dup_list.append(a + '(' + str(oid_count[oid]) + ')')

for dup in dup_list:
    writeMessage(dup)
writeMessage(' ')
writeMessage('records    : ' + str(len(oid_list)))
writeMessage('duplicates : ' + str(dup_count))
writeMessage('% errors   : ' + str(round(dup_count / len(oid_list), 4)))
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' unique valuestest complete')log.close()


--- delete start ---

del dup, dup_count, dup_list, gp, log, logFile, myFtrCls, myWrkspc
del oid, oid_list, row, rows, tmp_list
exit()

--- delete end ---

Not necessary, and it'll exit at the end of the script anyway.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Find duplicates in a list/array and count them ...

Reply via email to