New open-access research published in the journal of Parallel Computing demonstrates a novel approach to engineering analytics for deployment in streaming and batch contexts.
Increasing numbers of users are extracting real value from their data using tools like IBM InfoSphere Streams for near-real-time analysis and Apache Spark across their historical data. Until now, there hasn't been an approach which permits the use of these tools from a single shared codebase, with deployment considerations reserved until deployment time. Furthermore, it has been even harder to permit this unified analysis while maintaining cell-level traces of the security heritage for each datum an analytic produces. Some highlights of the paper include: - A domain specific language (CRUCIBLE) and runtime models for on- and off-line data analytics. - Detailed analysis of CRUCIBLE’s runtime performance in state-of-the-art environments. - Development and detailed analysis of a set of runtime models for new environments. - Performance comparison with native implementations and discussion of optimisation steps. - Formulation of a primitive in the DSL that permits an analytic to be run over multiple data sources. The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is available free of charge from Elsevier: http://www.sciencedirect.com/science/article/pii/S0167819114000842 I am one of the lead authors of the work, and would be more than happy to discuss any aspects which catch your attention! Peter -- Peter Coetzee Performance Computing and Visualisation PhD Candidate Department of Computer Science University of Warwick
