Thanks Fabian and Andrew for the responses.

Fabian - Yes that is what I was afraid of.  Flink seems perfect for batch
processing a pipeline.  In OpenRefine, we work with finite datasets and
just want an easier way to have distributed data storage for when our users
want to work with very large finite datasets.

Andrew - Apache Zeppelin performs some of the same magic that OpenRefine
does, but is more focused on exploratory analysis and leverages some of the
same technology that we are also looking at in more detail to see where/how
it fits with OpenRefine.

I especially like the idea of Apache YARN's NodeManager and also Apache
Spark's data access through DataFrame API and SQL against data sources like
Avro and Parquet, which is were both Jacky and I see perhaps the most
alignment with OpenRefine and giving our users an alternative storage /
compute option for handling bigger datasets than can fit in memory
currently with OpenRefine.

Any other thoughts, ideas. or pros/cons from anyone about anything I
mentioned ?

-Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>

Reply via email to