Re: OpenRefine is thinking about using Apache Flink

Fabian Hueske Wed, 14 Jun 2017 00:48:07 -0700

Hi Thad,

I'm not familiar with the internals of OpenRefine, but I would assume that
users can apply ad-hoc / exploratory transformations on data which is
loaded in memory (please correct me if my assumption is wrong).


Flink stores data in memory for efficient processing of data in motion
(either streaming data or pipelined batch data). However, this always
happens in the context of a predefined job that is executed. It is not
possible to load data in memory without knowing what to do next. If my
understanding of OpenRefine is correct, Flink does not seem to be a good
fit for your requirements.

Please let me know if you have further questions.

Best, Fabian

2017-06-07 5:10 GMT+02:00 Thad Guidry <thadgui...@gmail.com>:

> Hello Community !
>
> I'm a contributor to OpenRefine. You might have known about us previously
> as Google Refine. :) We are thinking of giving an alternative
> compute/storage engine in addition to our already existing one developed by
> Stephano Mazzocchi of Apache Cocoon fame. :) We need some insight from this
> community.
> The data is loaded into memory and the users workspace where the data lives
> is saved to disk occasionally or upon project closing.
> https://github.com/OpenRefine/OpenRefine/blob/master/main/
> src/com/google/refine/ProjectManager.java
>
> We have the concept of Undo and Redo within OpenRefine as well, and this
> seems to be equivalent perhaps to Flink's savepoints
> Besides Flink, we are thinking perhaps that Apache Spark might also work ?
> But unsure, and are looking at what can best align with our current data
> store to memory modeling that we have.
>
> We also have a cross() function which is similar to CoGroup in Flink.
>
> Thoughts ?
> -Thad
> +ThadGuidry <https://www.google.com/+ThadGuidry>
>

Re: OpenRefine is thinking about using Apache Flink

Reply via email to