On Wed, Apr 20, 2016 at 10:50 AM Sims, David (NIH/NCI) [C] < david.si...@nih.gov> wrote:
> Hi, > > Cross posted at > http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing, > but thought I'd try here too as no responses there so far. > > A bit new to python and very new to parallel processing in python. I have > a script that will process a datafile and generate a dict of dicts. > However, as I need to run this task on hundreds to thousands of these files > and ultimately collate the data, I thought parallel processing made a lot > of sense. However, I can't seem to figure out how to create a data > structure. Minimal script without all the helper functions: > > #!/usr/bin/python > import sys > import os > import re > import subprocess > import multiprocessing > from joblib import Parallel, delayed > from collections import defaultdict > from pprint import pprint > > def proc_vcf(vcf,results): > sample_name = vcf.rstrip('.vcf') > results.setdefault(sample_name, {}) > > # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to > generate a list of entries. Expect a dict of dict of lists > all_vars = run_cmd('vcfExtractor',vcf) > results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all') > > # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to > generate a different list of data based on a different set of criteria. > mois = run_cmd('moi_report', vcf) > results[sample_name]['mois'] = parse_variant_data(mois, 'moi') > return results > > def main(): > input_files = sys.argv[1:] > > # collected_data = defaultdict(lambda: defaultdict(dict)) > collected_data = {} > > # Parallel Processing version > # num_cores = multiprocessing.cpu_count() > # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for > vcf in input_files) > > # for vcf in input_files: > # proc_vcf(vcf, collected_data) > > pprint(dict(collected_data)) > return > > if __name__=="__main__": > main() > > > Hard to provide source data as it's very large, but basically, the dataset > will generate a dict of dicts of lists that contain two sets of data for > each input keyed by sample and data type: > > { 'sample1' : { > 'all_vars' : [ > 'data_val1', > 'data_val2', > 'etc'], > 'mois' : [ > 'data_val_x', > 'data_val_y', > 'data_val_z'] > } > 'sample2' : { > 'all_vars' : [ > . > . > . > ] > } > } > > If I run it without trying to multiprocess, not a problem. I can't figure > out how to parallelize this and create the same data structure. I've tried > to use defaultdict to create a defaultdict in main() to pass along, as well > as a few other iterations, but I can't seem to get it right (getting key > errors, pickle errors, etc.). Can anyone help me with the proper way to do > this? I think I'm not making / initializing / working with the data > structure correctly, but maybe my whole approach is ill conceived? > Processes cannot share memory, so your collected_data is only copied once, at the time you pass it to each subprocess. There's an undocumented ThreadPool that works the same as the process Pool ( https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers ) ThreadPool will share memory across your subthreads. In the example I liked to, just replace ``from multiprocessing import Pool`` with ``from multiprocessing.pool import ThreadPool``. How compute-intensive is your task? If it's mostly disk-read-intensive rather than compute-intensive, then threads is all you need. -- https://mail.python.org/mailman/listinfo/python-list