Re: OpenRefine is thinking about using Apache Flink

2017-06-19 Thread Fabian Hueske
Flink is not really well suited for interactive / adhoc processing. What could work is to use some local tool to identify the transformation rules and apply them with Flink to a large data set. But that's probably not what you are looking for, right? Best, Fabian 2017-06-15 3:11 GMT+02:00 qi cui

Re: OpenRefine is thinking about using Apache Flink

2017-06-16 Thread qi cui
Hi Andrew, That will be great if you can come up with something to show the idea. There are lots of wiki pages on the github you can refer to(including the server side architecture and client side architecture). The unique feature of the OpenRefine is its ability to have the user to interact with t

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Thad Guidry
Thanks Andrew ! That would be fantastic ! Even if your not successful at the trivial use case, just having a look at our source code and providing your comments or thoughts in our code on a forked branch as you explore and investigate...would be tremendously useful to us ! -Thad +ThadGuidry

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Andrew Psaltis
Thad, Based on your description that OpenRefine uses similar techniques as Zeeplin then I *think* the reading and writing will work. The Undo/Redo I am fuzzy on as. I will try over the next couple of days and see if I can make something like this work (at lest a trivial use case). Personally I th

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Thad Guidry
Andrew, So you idea is that Flink could be used as a storage abstraction layer for OpenRefine ? Where OpenRefine would use TableSources for reading and TableSinks for writing ? And would that still work with our concept of Undo/Redo in OpenRefine to use Flink's Savepoints in concert with TableSou

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Andrew Psaltis
Thad, In the case of something comparable to the Spark DataFrame / SQL -- you may be able to build Avro and/or Parquet TableSources[1] and TableSinks [2] for Flink. The CSVTableSource is here[3]. Then you should be able to have a comparable experience. [1] https://ci.apache.org/projects/flink/flin

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Thad Guidry
Thanks Fabian and Andrew for the responses. Fabian - Yes that is what I was afraid of. Flink seems perfect for batch processing a pipeline. In OpenRefine, we work with finite datasets and just want an easier way to have distributed data storage for when our users want to work with very large fin

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Andrew Psaltis
Hi Thad, I am not sure if this would work for OpenRefine, but could you follow the model that is used for Apache Zeppelin? Granted OpenRefine does not have the notion of an interpreter and Zeppelin is not holding all of the data in memory. However, you may be able to take that type of idea and and

Re: OpenRefine is thinking about using Apache Flink

2017-06-14 Thread Fabian Hueske
Hi Thad, I'm not familiar with the internals of OpenRefine, but I would assume that users can apply ad-hoc / exploratory transformations on data which is loaded in memory (please correct me if my assumption is wrong). Flink stores data in memory for efficient processing of data in motion (either

OpenRefine is thinking about using Apache Flink

2017-06-07 Thread Thad Guidry
Hello Community ! I'm a contributor to OpenRefine. You might have known about us previously as Google Refine. :) We are thinking of giving an alternative compute/storage engine in addition to our already existing one developed by Stephano Mazzocchi of Apache Cocoon fame. :) We need some insight fr