Creating Dict of Dict of Lists with joblib and Multiprocessing

Sims, David (NIH/NCI) [C] Wed, 20 Apr 2016 07:53:02 -0700

Hi,

Cross posted at 
http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing,
 but thought I'd try here too as no responses there so far.


A bit new to python and very new to parallel processing in python.  I have a 
script that will process a datafile and generate a dict of dicts.  However, as 
I need to run this task on hundreds to thousands of these files and ultimately 
collate the data, I thought parallel processing made a lot of sense.  However, 
I can't seem to figure out how to create a data structure.  Minimal script 
without all the helper functions:

#!/usr/bin/python
import sys
import os
import re
import subprocess
import multiprocessing
from joblib import Parallel, delayed
from collections import defaultdict
from pprint import pprint

def proc_vcf(vcf,results):
    sample_name = vcf.rstrip('.vcf')
    results.setdefault(sample_name, {})

    # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a 
list of entries. Expect a dict of dict of lists
    all_vars = run_cmd('vcfExtractor',vcf)
    results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all')

    # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a 
different list of data based on a different set of criteria.
    mois = run_cmd('moi_report', vcf)
    results[sample_name]['mois'] = parse_variant_data(mois, 'moi')
    return results

def main():
    input_files = sys.argv[1:]

    # collected_data = defaultdict(lambda: defaultdict(dict))
    collected_data = {}

    # Parallel Processing version
    # num_cores = multiprocessing.cpu_count()
    # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for vcf 
in input_files)

    # for vcf in input_files:
        # proc_vcf(vcf, collected_data)

    pprint(dict(collected_data))
    return

if __name__=="__main__":
    main()


Hard to provide source data as it's very large, but basically, the dataset will 
generate a dict of dicts of lists that contain two sets of data for each input 
keyed by sample and data type:

{ 'sample1' : {
    'all_vars' : [
        'data_val1',
        'data_val2',
        'etc'],
    'mois' : [
        'data_val_x',
        'data_val_y',
        'data_val_z']
    }
    'sample2' : {
       'all_vars' : [
       .
       .
       .
       ]
    }
}

If I run it without trying to multiprocess, not a problem.  I can't figure out 
how to parallelize this and create the same data structure.  I've tried to use 
defaultdict to create a defaultdict in main() to pass along, as well as a few 
other iterations, but I can't seem to get it right (getting key errors, pickle 
errors, etc.).  Can anyone help me with the proper way to do this?  I think I'm 
not making / initializing / working with the data structure correctly, but 
maybe my whole approach is ill conceived?
-- 
https://mail.python.org/mailman/listinfo/python-list

Creating Dict of Dict of Lists with joblib and Multiprocessing

Reply via email to