Hi, Cross posted at http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing, but thought I'd try here too as no responses there so far.
A bit new to python and very new to parallel processing in python. I have a script that will process a datafile and generate a dict of dicts. However, as I need to run this task on hundreds to thousands of these files and ultimately collate the data, I thought parallel processing made a lot of sense. However, I can't seem to figure out how to create a data structure. Minimal script without all the helper functions: #!/usr/bin/python import sys import os import re import subprocess import multiprocessing from joblib import Parallel, delayed from collections import defaultdict from pprint import pprint def proc_vcf(vcf,results): sample_name = vcf.rstrip('.vcf') results.setdefault(sample_name, {}) # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a list of entries. Expect a dict of dict of lists all_vars = run_cmd('vcfExtractor',vcf) results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all') # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a different list of data based on a different set of criteria. mois = run_cmd('moi_report', vcf) results[sample_name]['mois'] = parse_variant_data(mois, 'moi') return results def main(): input_files = sys.argv[1:] # collected_data = defaultdict(lambda: defaultdict(dict)) collected_data = {} # Parallel Processing version # num_cores = multiprocessing.cpu_count() # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for vcf in input_files) # for vcf in input_files: # proc_vcf(vcf, collected_data) pprint(dict(collected_data)) return if __name__=="__main__": main() Hard to provide source data as it's very large, but basically, the dataset will generate a dict of dicts of lists that contain two sets of data for each input keyed by sample and data type: { 'sample1' : { 'all_vars' : [ 'data_val1', 'data_val2', 'etc'], 'mois' : [ 'data_val_x', 'data_val_y', 'data_val_z'] } 'sample2' : { 'all_vars' : [ . . . ] } } If I run it without trying to multiprocess, not a problem. I can't figure out how to parallelize this and create the same data structure. I've tried to use defaultdict to create a defaultdict in main() to pass along, as well as a few other iterations, but I can't seem to get it right (getting key errors, pickle errors, etc.). Can anyone help me with the proper way to do this? I think I'm not making / initializing / working with the data structure correctly, but maybe my whole approach is ill conceived? -- https://mail.python.org/mailman/listinfo/python-list