Hi Moritz, Yes, yes I am.
I've gotten to the point where I've got a job service runner set up and am trying to get the spark and archive storage volume set up properly. I can only find vague references, but it seems like there needs to be an accessible shared drive between the spark cluster workers and the job service; I've been unsuccessful at configuring that so far and I'm not sure exactly where that config lives. Basically k8s readwritemany in my case, but it seems like it should support some generic NFS or cloud bucket storage and doesn't have to be managed by k8s. Best, Jon On Thu, Jul 20, 2023 at 1:14 AM Moritz Mack <mm...@talend.com> wrote: > Hi Jon, > > > > I just want to check in here briefly, are you still looking for support on > this? > > Sadly yes, this totally lacks documentation and isn’t straight forward to > set up. > > > > /Moritz > > > > On 21.06.23, 23:47, "Jon Molle via user" <user@beam.apache.org> wrote: > > > > Hi Pavel, Thanks for your response! I took a look at running Beam on > Kinesis (analytics), as it is the AWS-recommended way to run Beam jobs. It > seems like it doesn't work with the portable runner model. Our project is a > daemon running in > > Hi Pavel, > > > > Thanks for your response! I took a look at running Beam on Kinesis > (analytics), as it is the AWS-recommended way to run Beam jobs. It seems > like it doesn't work with the portable runner model. Our project is a > daemon running in a kubernetes cluster that has Beam code running as part > of certain tasks, so I'm not exactly sure how that would work with Kinesis > as I don't see a way to grab the master URL (and I'm not entirely sure if > the flink image being run by Kinesis would work for Beam). I'd really like > to avoid using any of the non-portable runners if possible. > > > > That's part of why I am looking at Spark (although flink looks fairly > similar): EKS supports autoscaling and other features dataflow does. I > don't want to make a huge divergence between the GCP and AWS behaviour if > possible. It seems possible, but the docs for the other runners are a bit > ambiguous on exactly how much of submitting jobs is handled by the runner. > > > > On Wed, Jun 21, 2023 at 12:28 PM Pavel Solomin <p.o.solo...@gmail.com> > wrote: > > Hello! > > > > > to also run on AWS > > > > > A spark cluster on EKS seems the closest analog > > > > There's another way of running Beam apps in AWS - > https://aws.amazon.com/kinesis/data-analytics/ > <https://urldefense.com/v3/__https:/aws.amazon.com/kinesis/data-analytics/__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSnTB0hdzg$> > - which is basically "serverless" Flink. It says Kinesis, but you can run > any Flink / Beam job there, you don't have to use Kinesis streams. I used > KDA in multiple projects so far, works OK. FlinkRunner also seems to have > more docs as far as I can see. > > > > Here's a pom.xml example: > https://github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml > <https://urldefense.com/v3/__https:/github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSkcebCMJg$> > > > > Best Regards, > Pavel Solomin > > Tel: +351 962 950 692 <+351%20962%20950%20692> | Skype: pavel_solomin | > Linkedin > <https://urldefense.com/v3/__https:/www.linkedin.com/in/pavelsolomin__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSkakb4QEA$> > > > > > > > > On Wed, 21 Jun 2023 at 16:31, Jon Molle via user <user@beam.apache.org> > wrote: > > Hi, > > > > I've been looking at the Spark Portable Runner docs, specifically Java > when possible, and I'm a little confused about the organization. The docs > seem to say that the JobService both submits the code to the linked spark > cluster (described in the master url) and requires you to run a > spark-submit command after on whatever artifacts it builds. > > > > Unfortunately I'm not that familiar with Spark generally, so I'm probably > misunderstanding more here, but the job server images either totally lack > documentation or just repeat the spark runner page in the main docs. > > > > For context, I'm trying to port some code that we're currently running on > a Dataflow runner (on GCP) to also run on AWS. A spark cluster on EKS > (either self-managed or potentially through EMR, but likely not based on > what I am reading into the docs and some brief testing) seems the closest > analog. > > > > The new Tour does the same thing, in addition to only really having > examples for python and a few more typos. I haven't found any existing > questions like this elsewhere, so I assume that I'm just missing something > that should be obvious. > > > > Thanks for your time. > > *As a recipient of an email from the Talend Group, your personal data will > be processed by our systems. Please see our Privacy Notice > <https://www.talend.com/privacy-policy/>*for more information about our > collection and use of your personal information, our security practices, > and your data protection rights, including any rights you may have to > object to automated-decision making or profiling we use to analyze support > or marketing related communications. To manage or discontinue promotional > communications, use the communication preferences portal > <https://info.talend.com/emailpreferencesen.html>. To exercise your data > protection rights, use the privacy request form > <https://talend.my.onetrust.com/webform/ef906c5a-de41-4ea0-ba73-96c079cdd15a/b191c71d-f3cb-4a42-9815-0c3ca021704cl>. > Contact us here <https://www.talend.com/contact/>or by mail to either of > our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San > Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, > France >