Hello, I've got several versions of code to here to generate a histogram-esque structure from rows in a CSV file.
The basic approach is to use a Dict as a bucket collection to count instances of data items. Other than the try/except(KeyError) idiom for dealing with new bucket names, which I don't like as it desribes the initial state of a KeyValue _after_ you've just described what to do with the existing value, I've come up with a few other methods. What seems like to most resonable approuch? Do you have any other ideas? Is the try/except(KeyError) idiom reallyteh best? In the code below you will see several 4-line groups of code. Each of set of the n-th line represents one solution to the problem. (Cases 1 & 2 do differ from cases 3 & 4 in the final outcome.) Thank you :) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ from collections import defaultdict from csv import DictReader from pprint import pprint dataFile = open("sampledata.csv") dataRows = DictReader(dataFile) catagoryStats = defaultdict(lambda : {'leaf' : '', 'count' : 0}) #catagoryStats = {} #catagoryStats = defaultdict(int) #catagoryStats = {} for row in dataRows: catagoryRaw = row['CATEGORIES'] catagoryLeaf = catagoryRaw.split('|').pop() ## csb => Catagory Stats Bucket ## multi-statement lines are used for ease of method switching. csb = catagoryStats[catagoryRaw]; csb['count'] += 1; csb['leaf'] = catagoryLeaf #csb = catagoryStats.setdefault(catagoryRaw, {'leaf' : '', 'count' : 0}); csb['count'] += 1; csb['leaf'] = catagoryLeaf #catagoryStats[catagoryRaw] += 1 #catagoryStats[catagoryRaw] = catagoryStats.get(catagoryRaw, 0) + 1 catagoryStatsSorted = catagoryStats.items() catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1) #catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1) #catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1) #catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1) pprint(catagoryStatsSorted, indent=4, width=60) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ sampledata.csv ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CATEGORIES,SKU "computers|laptops|accessories",12345 "computers|laptops|accessories",12345 "computers|laptops|accessories",12345 "computers|servers|accessories",12345 "computers|servers|accessories",12345 "computers|servers|accessories",12345 "computers|servers|accessories",12345 "computers|servers|accessories",12345 "toys|really|super_fun",12345 "toys|really|super_fun",12345 "toys|really|super_fun",12345 "toys|really|not_at_all_fun",12345 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ output: (in case #1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In [1]: %run catstat.py [ ( 'computers|servers|accessories', {'count': 5, 'leaf': 'accessories'}), ( 'toys|really|super_fun', {'count': 3, 'leaf': 'super_fun'}), ( 'computers|laptops|accessories', {'count': 3, 'leaf': 'accessories'}), ( 'toys|really|not_at_all_fun', {'count': 1, 'leaf': 'not_at_all_fun'})] -- http://mail.python.org/mailman/listinfo/python-list