Hello all Sorry to "spam" the list, but we'd really like to get as wide a range of input as possible on features for the next release of PMIx (see below). I haven't attached
Ralph ========== PMIx 1.0 has now been released. If you haven’t looked at it, I invite you to please do so. I’ve attached the API definitions, and you’ll find more (slightly outdated, I’m afraid) here: https://github.com/open-mpi/pmix/wiki As a reminder, the intent behind PMIx is to transparently provide backward compatibility for PMI-1 and PMI-2, while extending the APIs to support advanced capabilities and providing exascale performance. Support by SLURM, ORCM, and other RMs will be coming later this year. I am working right now on completing the embedded support for OpenMPI, and hope to release that in the next week or two - at that time, any job executed via mpirun will have full support for PMIx functions. I’d like to invite your input for the upcoming v2.0 APIs. Our initial plan is to release 2.0 in time for SC15, with the expectation that we may not have all the features implemented yet - whether we add them during the 2.0 series, or delay some to 3.0 remains TBD. The initial thought is to focus 2.0 in the following areas - please note that we would deeply appreciate the involvement of each relevant community, so please feel free to forward this note and/or reach out to relevant representatives: 1. Performance improvements * dynamic spawn/reap of listening threads to achieve target performance of completing 1000 client connections in < 1 sec * shared memory use to reduce memory footprint (Elena has already sent out some thoughts on this) 2. Fault response support We currently provide application notification of faults (existing and impending) that includes information on the impacted processes. However, the response is currently limited to calling PMIx_Abort - i.e., the app can take internal action, but the only request it can make of the RM is to abort. We do allow for abort of specific procs as opposed to the entire job, but we’d like to support a broader set of options. For example, the app might request a coordinated checkpoint, ask for replacement nodes to be allocated, or request immediate restart at a reduced size. 3. File system support We would like to begin supporting file positioning directives - e.g., hot/warm/cold data movement, persistence requests to maintain files and/or shared memory regions across job steps, and burst buffer management. 4. Network/fabric support The existing notification capability can be used to notify of network issues. However, there has been interest expressed in further interactions that would allow an application to specify quality of service and security requirements, request information on network topology, etc. 5. Power directives On very large scale systems, it is expected that some form of power management will be required or desired. Most of that happens at allocation request time, but there may be some possible directives an app could want to pass during execution. We’re open to suggestion. 6. Workflow support We have the "spawn" support in PMIx 1.0, but that was designed expressly for support of MPI applications. Other programming models may require different or additional support. PMIx is intended to support a wide range of models, and we'd welcome input on how workflows can be better supported. Any other topics of interest are always welcome! Ralph