Re: OpenRefine is thinking about using Apache Flink

Andrew Psaltis Wed, 14 Jun 2017 05:54:18 -0700

Hi Thad,
I am not sure if this would work for OpenRefine, but could you follow the
model that is used for Apache Zeppelin? Granted OpenRefine does not have
the notion of an interpreter and Zeppelin is not holding all of the data in
memory. However,  you may be able to take that type of idea and and some of
the ideas of a job server such as Livy and pull it off. At first blush
seems like you are going to have to dynamically generate code for tasks a
user wants to do, however you may be able to pull off all the data
wrangling needs a user has and allow them to use a dataset that cannot fit
in memory or on local disk.


Thanks,
Andrew

On Wed, Jun 14, 2017 at 9:47 AM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Thad,
>
> I'm not familiar with the internals of OpenRefine, but I would assume that
> users can apply ad-hoc / exploratory transformations on data which is
> loaded in memory (please correct me if my assumption is wrong).
>
> Flink stores data in memory for efficient processing of data in motion
> (either streaming data or pipelined batch data). However, this always
> happens in the context of a predefined job that is executed. It is not
> possible to load data in memory without knowing what to do next. If my
> understanding of OpenRefine is correct, Flink does not seem to be a good
> fit for your requirements.
>
> Please let me know if you have further questions.
>
> Best, Fabian
>
> 2017-06-07 5:10 GMT+02:00 Thad Guidry <thadgui...@gmail.com>:
>
> > Hello Community !
> >
> > I'm a contributor to OpenRefine. You might have known about us previously
> > as Google Refine. :) We are thinking of giving an alternative
> > compute/storage engine in addition to our already existing one developed
> by
> > Stephano Mazzocchi of Apache Cocoon fame. :) We need some insight from
> this
> > community.
> > The data is loaded into memory and the users workspace where the data
> lives
> > is saved to disk occasionally or upon project closing.
> > https://github.com/OpenRefine/OpenRefine/blob/master/main/
> > src/com/google/refine/ProjectManager.java
> >
> > We have the concept of Undo and Redo within OpenRefine as well, and this
> > seems to be equivalent perhaps to Flink's savepoints
> > Besides Flink, we are thinking perhaps that Apache Spark might also work
> ?
> > But unsure, and are looking at what can best align with our current data
> > store to memory modeling that we have.
> >
> > We also have a cross() function which is similar to CoGroup in Flink.
> >
> > Thoughts ?
> > -Thad
> > +ThadGuidry <https://www.google.com/+ThadGuidry>
> >
>



-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Re: OpenRefine is thinking about using Apache Flink

Reply via email to