Okay cool. I’ll keep those things in mind as well. Thanks. Geoff
On Jun 12, 2019, at 1:22 PM, Wes McKinney <wesmck...@gmail.com<mailto:wesmck...@gmail.com>> wrote: hi Geoff, It's great to hear about your benchmarking results. If you'd like to submit a pull request to the project with a blog post to showcase your results please be our guest (the blog content is all under the site/ directory, let us know if you have any issues). The only limitation on the Apache Arrow blog is that we must maintain independence and not endorse (or otherwise make subjective comments about) any third party companies, products, or services. best, Wes On Wed, Jun 12, 2019 at 12:14 PM Lentner, Geoffrey R <glent...@purdue.edu<mailto:glent...@purdue.edu>> wrote: Hi everyone. I work as a data scientist in the research computing group at Purdue University. Mostly I help facilitate the use of Purdue’s supercomputing clusters by research faculty by helping with scientific software development, consulting on data analysis and data management, holding workshops, etc. I have started suggesting Apache Arrow in a lot of my conversations. I have had great success using the Plasma store as a way of holding “reference data” and having many workers access it without the need to do any duplication (persistent worker - many tasks) or out-of-core computing (simple workers - lazy loading). This benefits many scenarios (e.g., rendering frames in a POV based visualization; many frames - single data). Towards the end of June I’ll be running a benchmark on one of our clusters (Brown - https://www.rcac.purdue.edu/compute/brown) to test performance in scaling IPyParallel past the 500+ node mark (20,000 workers). One of the tasks is to load a dataset from disk (and then do something trivial like compute summary statistics). This stresses both the file system (Lustre) and the network (IB fabric). The comparison was to have every node running its own Plasma store and pre-load the dataset and have the workers access it via client connections to the store. Preliminary tests (up to 48 nodes) showed that with a 2 GB dataset taking ~20 seconds (almost entirely loading/parsing) doing the statistics on a data structure accessed via Plasma took < 1s (validating that the compute time is merely that of the summary statistics on an already loaded dataset). And that this was sustained clear up to the 24 cores per node (meaning 24 client connections and attempted simultaneous access). If there is interest, I would be thrilled to compose the results of this test into a blog post for the website - detailing the workflow, where this approach can be applied in different research domains, and performance comparisons. Cheers. Geoff -- Geoffrey Lentner <glent...@purdue.edu<mailto:glent...@purdue.edu><mailto:glent...@purdue.edu>> Data Scientist. ITaP Research Computing. Purdue University. @PurdueRCAC @GeoffreyLentner