Re: Apache Arrow / Plasma Benchmarks in HPC

Lentner, Geoffrey R Wed, 12 Jun 2019 10:27:08 -0700

Okay cool. I’ll keep those things in mind as well.

Thanks.
Geoff



On Jun 12, 2019, at 1:22 PM, Wes McKinney 
<wesmck...@gmail.com<mailto:wesmck...@gmail.com>> wrote:

hi Geoff,

It's great to hear about your benchmarking results. If you'd like to
submit a pull request to the project with a blog post to showcase your
results please be our guest (the blog content is all under the site/
directory, let us know if you have any issues). The only limitation on
the Apache Arrow blog is that we must maintain independence and not
endorse (or otherwise make subjective comments about) any third party
companies, products, or services.

best,
Wes

On Wed, Jun 12, 2019 at 12:14 PM Lentner, Geoffrey R
<glent...@purdue.edu<mailto:glent...@purdue.edu>> wrote:

Hi everyone.

I work as a data scientist in the research computing group at Purdue 
University. Mostly I help facilitate the use of Purdue’s supercomputing 
clusters by research faculty by helping with scientific software development, 
consulting on data analysis and data management, holding workshops, etc. I have 
started suggesting Apache Arrow in a lot of my conversations.

I have had great success using the Plasma store as a way of holding “reference 
data” and having many workers access it without the need to do any duplication 
(persistent worker - many tasks) or out-of-core computing (simple workers - 
lazy loading). This benefits many scenarios (e.g., rendering frames in a POV 
based visualization; many frames - single data).

Towards the end of June I’ll be running a benchmark on one of our clusters 
(Brown - https://www.rcac.purdue.edu/compute/brown) to test performance in 
scaling IPyParallel past the 500+ node mark (20,000 workers). One of the tasks 
is to load a dataset from disk (and then do something trivial like compute 
summary statistics). This stresses both the file system (Lustre) and the 
network (IB fabric). The comparison was to have every node running its own 
Plasma store and pre-load the dataset and have the workers access it via client 
connections to the store.

Preliminary tests (up to 48 nodes) showed that with a 2 GB dataset taking ~20 
seconds (almost entirely loading/parsing) doing the statistics on a data 
structure accessed via Plasma took < 1s (validating that the compute time is 
merely that of the summary statistics on an already loaded dataset). And that 
this was sustained clear up to the 24 cores per node (meaning 24 client 
connections and attempted simultaneous access).

If there is interest, I would be thrilled to compose the results of this test 
into a blog post for the website - detailing the workflow, where this approach 
can be applied in different research domains, and performance comparisons.


Cheers.
Geoff

--
Geoffrey Lentner 
<glent...@purdue.edu<mailto:glent...@purdue.edu><mailto:glent...@purdue.edu>>
Data Scientist. ITaP Research Computing. Purdue University.
@PurdueRCAC @GeoffreyLentner

Re: Apache Arrow / Plasma Benchmarks in HPC

Reply via email to