Re: RDD-like API for entirely local workflows?

Antonin Delpeuch (lists) Sat, 04 Jul 2020 10:46:11 -0700

Hi Juan,

Of course! My prototype is here:
https://github.com/OpenRefine/OpenRefine/tree/spark-prototype


I suspect it can be quite hard for you to jump in the code at this stage
of the project, but here are some concise pointers:

The or-spark module contains the Spark-based implementation of our
datamodel. The tasks themselves are generated by the application code
(in the "main" module).

You can try the prototype as a user (clone the repo, checkout the branch
and hit ./refine). If you import a small CSV file via the Clipboard
pane, you can then run a few operations on it and observe the tasks in
Spark's web UI.

I would be happy to give you any additional pointers (perhaps off-list?)
if you want to have a close look.

One general question I have for the list is: do you have a good way to
inspect and optimize the serialization of tasks?

Thank you so much for all your help so far!
Antonin


On 04/07/2020 19:19, Juan Martín Guillén wrote:
> Would you be able to send the code you are running?
> That would be great if you include some sample data.
> Is that possible?
> 
> 
> El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists)
> <li...@antonin.delpeuch.eu> escribió:
> 
> 
> Hi Stephen and Juan,
> 
> Thanks both for your replies - you are right, I used the wrong
> terminology! The local mode is what fits our needs best (and what I have
> benchmarking so far).
> 
> That being said, the problems I mention are still applicable to this
> context. There is still a serialization overhead (which can be observed
> from the web UI), which is really noticeable as a user.
> 
> For instance, to display the paginated grid in the tool's UI, I need to
> run a simple job (filterByRange), and Spark's own overheads account for
> about half of the overall execution time.
> 
> Intuitively, when running in local mode there should not be any need for
> serializing tasks to pass them between threads, so that is what I am
> trying to eliminate.
> 
> Regards,
> Antonin
> 
> On 04/07/2020 17:49, Juan Martín Guillén wrote:
>> Hi Antonin.
>>
>> It seems you are confusing Standalone with Local mode. They are 2
>> different modes.
>>
>> From Spark in Action book: "In local mode, there is only one executor in
>> the same client JVM as the driver, but
>> this executor can spawn several threads to run tasks.
>> In local mode, Spark uses your client process as the single executor in
>> the cluster,
>> and the number of threads specified determines how many tasks can be
>> executed in parallel."
>>
>> I am pretty sure this is the mode your use case is more suited to.
>>
>> What you are referring to, I think, is to run an Standalone Cluster
>> locally, something that does not make too much sense resources wise and
>> is what may be considered only for testing purposes.
>>
>> Running Spark in Local mode is totally fine and supported for
>> non-cluster (local) environments.
>>
>> Here the options you have to connect you Spark application to:
>>
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
>>
>> Regards,
>> Juan Martín.
>>
>>
>>
>>
>> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
>> <li...@antonin.delpeuch.eu <mailto:li...@antonin.delpeuch.eu>> escribió:
>>
>>
>> Hi,
>>
>> I am working on revamping the architecture of OpenRefine, an ETL tool,
>> to execute workflows on datasets which do not fit in RAM.
>>
>> Spark's RDD API is a great fit for the tool's operations, and provides
>> everything we need: partitioning and lazy evaluation.
>>
>> However, OpenRefine is a lightweight tool that runs locally, on the
>> users' machine, and we want to preserve this use case. Running Spark in
>> standalone mode works, but I have read at a couple of places that the
>> standalone mode is only intended for development and testing. This is
>> confirmed by my experience with it so far:
>> - the overhead added by task serialization and scheduling is significant
>> even in standalone mode. This makes sense for testing, since you want to
>> test serialization as well, but to run Spark in production locally, we
>> would need to bypass serialization, which is not possible as far as I
> know;
>> - some bugs that manifest themselves only in local mode are not getting
>> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
>> it seems dangerous to base a production system on standalone Spark.
>>
>> So, we cannot use Spark as default runner in the tool. Do you know any
>> alternative which would be designed for local use? A library which would
>> provide something similar to the RDD API, but for parallelization with
>> threads in the same JVM, not machines in a cluster?
>>
>> If there is no such thing, it should not be too hard to write our
>> homegrown implementation, which would basically be Java streams with
>> partitioning. I have looked at Apache Beam's direct runner, but it is
>> also designed for testing so does not fit our bill for the same reasons.
>>
>> We plan to offer a Spark-based runner in any case - but I do not think
>> it can be used as the default runner.
>>
>> Cheers,
>> Antonin
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>
>> <mailto:user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>>
> 
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: RDD-like API for entirely local workflows?

Reply via email to