Re: Spark

Matt Casters Fri, 18 Jan 2019 00:37:30 -0800

Thanks for the reply JC, I really appreciate it.

I really can't force our users to use antiquated stuff like scripts, let
alone command line things, but I'll simply use SparkLauncher and your
comment about the main class doing Pipeline.run() on the Master is
something I can work with... somewhat.
The execution results, metrics and all that are handled the Master I
guess.  Over time I'll figure out a way to report the metrics and results
from the master back to the client.  I've done similar things with
Map/Reduce in the past.


Looking around I see that the same conditions apply for Flink.  Is this
because Spark and Flink lack the APIs to talk to a client about the state
of workloads unlike DataFlow and the Direct Runner?

Thanks!

Matt
---
Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com>
Senior Solution Architect, Kettle Project Founder




Op do 17 jan. 2019 om 15:30 schreef Juan Carlos Garcia <jcgarc...@gmail.com
>:

> Hi Matt, during the time we were using Spark with Beam, the solution was
> always to pack the jar and use the spark-submit command pointing to your
> main class which will do `pipeline.run`.
>
> The spark-submit command have a flag to decide how to run it
> (--deploy-mode), whether to launch the job on the driver machine or in one
> of the machine in the cluster.
>
>
> JC
>
>
> On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <mattcast...@gmail.com>
> wrote:
>
>> Dear Beam friends,
>>
>> Now that I've got cool data integration (Kettle-beam) scenarios running
>> on DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery,
>> Streaming, Windowing, ...) I thought it was time to also give Apache Spark
>> some attention.
>>
>> The thing I have some trouble with it figuring out what the relationship
>> is between the runner (SparkRunner), Pipeline.run() and spark-submit (or
>> SparkLauncher).
>>
>> The samples I'm seeing mostly involve packaging up a jar file and then
>> doing a spark-submit.  That obviously makes it unclear if Pipeline.run()
>> should be used at all and how Metrics should be obtained from a Spark job
>> during execution or after completion.
>>
>> I really like the way the GCP DataFlow implementation automatically
>> deploys jar file binaries and from what I can
>> determine org.apache.spark.launcher.SparkLauncher offers this functionality
>> so perhaps I'm either doing something wrong or I'm reading the docs wrong
>> or the wrong docs.
>> The thing is, if you try running your pipelines against a Spark master
>> feedback is really minimal putting you in a trial & error situation pretty
>> quickly.
>>
>> So thanks again in advance for any help!
>>
>> Cheers,
>>
>> Matt
>> ---
>> Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com>
>> Senior Solution Architect, Kettle Project Founder
>>
>>
>
> --
>
> JC
>
>

Re: Spark

Reply via email to