Re: Spark

Juan Carlos Garcia Fri, 18 Jan 2019 04:04:36 -0800

Hi Matt,

With flink you will be able launch your pipeline just by invoking the main
method of your main class, however it will run as standalone process and
you will not have the advantage of distribute computation.


Am Fr., 18. Jan. 2019, 09:37 hat Matt Casters <mattcast...@gmail.com>
geschrieben:

> Thanks for the reply JC, I really appreciate it.
>
> I really can't force our users to use antiquated stuff like scripts, let
> alone command line things, but I'll simply use SparkLauncher and your
> comment about the main class doing Pipeline.run() on the Master is
> something I can work with... somewhat.
> The execution results, metrics and all that are handled the Master I
> guess.  Over time I'll figure out a way to report the metrics and results
> from the master back to the client.  I've done similar things with
> Map/Reduce in the past.
>
> Looking around I see that the same conditions apply for Flink.  Is this
> because Spark and Flink lack the APIs to talk to a client about the state
> of workloads unlike DataFlow and the Direct Runner?
>
> Thanks!
>
> Matt
> ---
> Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>
>
> Op do 17 jan. 2019 om 15:30 schreef Juan Carlos Garcia <
> jcgarc...@gmail.com>:
>
>> Hi Matt, during the time we were using Spark with Beam, the solution was
>> always to pack the jar and use the spark-submit command pointing to your
>> main class which will do `pipeline.run`.
>>
>> The spark-submit command have a flag to decide how to run it
>> (--deploy-mode), whether to launch the job on the driver machine or in one
>> of the machine in the cluster.
>>
>>
>> JC
>>
>>
>> On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <mattcast...@gmail.com>
>> wrote:
>>
>>> Dear Beam friends,
>>>
>>> Now that I've got cool data integration (Kettle-beam) scenarios running
>>> on DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery,
>>> Streaming, Windowing, ...) I thought it was time to also give Apache Spark
>>> some attention.
>>>
>>> The thing I have some trouble with it figuring out what the relationship
>>> is between the runner (SparkRunner), Pipeline.run() and spark-submit (or
>>> SparkLauncher).
>>>
>>> The samples I'm seeing mostly involve packaging up a jar file and then
>>> doing a spark-submit.  That obviously makes it unclear if Pipeline.run()
>>> should be used at all and how Metrics should be obtained from a Spark job
>>> during execution or after completion.
>>>
>>> I really like the way the GCP DataFlow implementation automatically
>>> deploys jar file binaries and from what I can
>>> determine org.apache.spark.launcher.SparkLauncher offers this functionality
>>> so perhaps I'm either doing something wrong or I'm reading the docs wrong
>>> or the wrong docs.
>>> The thing is, if you try running your pipelines against a Spark master
>>> feedback is really minimal putting you in a trial & error situation pretty
>>> quickly.
>>>
>>> So thanks again in advance for any help!
>>>
>>> Cheers,
>>>
>>> Matt
>>> ---
>>> Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com>
>>> Senior Solution Architect, Kettle Project Founder
>>>
>>>
>>
>> --
>>
>> JC
>>
>>

Re: Spark

Reply via email to