Hi Juan, Of course! My prototype is here: https://github.com/OpenRefine/OpenRefine/tree/spark-prototype
I suspect it can be quite hard for you to jump in the code at this stage of the project, but here are some concise pointers: The or-spark module contains the Spark-based implementation of our datamodel. The tasks themselves are generated by the application code (in the "main" module). You can try the prototype as a user (clone the repo, checkout the branch and hit ./refine). If you import a small CSV file via the Clipboard pane, you can then run a few operations on it and observe the tasks in Spark's web UI. I would be happy to give you any additional pointers (perhaps off-list?) if you want to have a close look. One general question I have for the list is: do you have a good way to inspect and optimize the serialization of tasks? Thank you so much for all your help so far! Antonin On 04/07/2020 19:19, Juan Martín Guillén wrote: > Would you be able to send the code you are running? > That would be great if you include some sample data. > Is that possible? > > > El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) > <li...@antonin.delpeuch.eu> escribió: > > > Hi Stephen and Juan, > > Thanks both for your replies - you are right, I used the wrong > terminology! The local mode is what fits our needs best (and what I have > benchmarking so far). > > That being said, the problems I mention are still applicable to this > context. There is still a serialization overhead (which can be observed > from the web UI), which is really noticeable as a user. > > For instance, to display the paginated grid in the tool's UI, I need to > run a simple job (filterByRange), and Spark's own overheads account for > about half of the overall execution time. > > Intuitively, when running in local mode there should not be any need for > serializing tasks to pass them between threads, so that is what I am > trying to eliminate. > > Regards, > Antonin > > On 04/07/2020 17:49, Juan Martín Guillén wrote: >> Hi Antonin. >> >> It seems you are confusing Standalone with Local mode. They are 2 >> different modes. >> >> From Spark in Action book: "In local mode, there is only one executor in >> the same client JVM as the driver, but >> this executor can spawn several threads to run tasks. >> In local mode, Spark uses your client process as the single executor in >> the cluster, >> and the number of threads specified determines how many tasks can be >> executed in parallel." >> >> I am pretty sure this is the mode your use case is more suited to. >> >> What you are referring to, I think, is to run an Standalone Cluster >> locally, something that does not make too much sense resources wise and >> is what may be considered only for testing purposes. >> >> Running Spark in Local mode is totally fine and supported for >> non-cluster (local) environments. >> >> Here the options you have to connect you Spark application to: >> > https://spark.apache.org/docs/latest/submitting-applications.html#master-urls >> >> Regards, >> Juan Martín. >> >> >> >> >> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) >> <li...@antonin.delpeuch.eu <mailto:li...@antonin.delpeuch.eu>> escribió: >> >> >> Hi, >> >> I am working on revamping the architecture of OpenRefine, an ETL tool, >> to execute workflows on datasets which do not fit in RAM. >> >> Spark's RDD API is a great fit for the tool's operations, and provides >> everything we need: partitioning and lazy evaluation. >> >> However, OpenRefine is a lightweight tool that runs locally, on the >> users' machine, and we want to preserve this use case. Running Spark in >> standalone mode works, but I have read at a couple of places that the >> standalone mode is only intended for development and testing. This is >> confirmed by my experience with it so far: >> - the overhead added by task serialization and scheduling is significant >> even in standalone mode. This makes sense for testing, since you want to >> test serialization as well, but to run Spark in production locally, we >> would need to bypass serialization, which is not possible as far as I > know; >> - some bugs that manifest themselves only in local mode are not getting >> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so >> it seems dangerous to base a production system on standalone Spark. >> >> So, we cannot use Spark as default runner in the tool. Do you know any >> alternative which would be designed for local use? A library which would >> provide something similar to the RDD API, but for parallelization with >> threads in the same JVM, not machines in a cluster? >> >> If there is no such thing, it should not be too hard to write our >> homegrown implementation, which would basically be Java streams with >> partitioning. I have looked at Apache Beam's direct runner, but it is >> also designed for testing so does not fit our bill for the same reasons. >> >> We plan to offer a Spark-based runner in any case - but I do not think >> it can be used as the default runner. >> >> Cheers, >> Antonin >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> >> <mailto:user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org>> > >> > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org