Thanks Fabian and Andrew for the responses. Fabian - Yes that is what I was afraid of. Flink seems perfect for batch processing a pipeline. In OpenRefine, we work with finite datasets and just want an easier way to have distributed data storage for when our users want to work with very large finite datasets.
Andrew - Apache Zeppelin performs some of the same magic that OpenRefine does, but is more focused on exploratory analysis and leverages some of the same technology that we are also looking at in more detail to see where/how it fits with OpenRefine. I especially like the idea of Apache YARN's NodeManager and also Apache Spark's data access through DataFrame API and SQL against data sources like Avro and Parquet, which is were both Jacky and I see perhaps the most alignment with OpenRefine and giving our users an alternative storage / compute option for handling bigger datasets than can fit in memory currently with OpenRefine. Any other thoughts, ideas. or pros/cons from anyone about anything I mentioned ? -Thad +ThadGuidry <https://www.google.com/+ThadGuidry>