It was my understanding that the client first uploads the artifacts to the jobserver and then the SDK harness will pull in these artifacts from the jobserver over a gRPC port.
I see the artifacts on the jobserver while the job is attempting to run: root@flink-beam-jobserver-9fccb99b8-6mhtq :/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4 Do the jobserver and the taskmanager need to share the artifact staging volume? On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver <kcwea...@google.com> wrote: > > rpc error: code = Unknown desc = ; failed to retrieve chunk for > /tmp/staged/pickled_main_session > > Are you sure that's due to a networking issue, and not a problem with the > filesystem / volume mounting? > > On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <samb...@gmail.com> wrote: > >> I would not be surprised if there was something weird going on with >>> Docker in Docker. The defaults mostly work fine when an external SDK >>> harness is used [1]. >>> >> Can you provide more information on the exception you got? (I'm >>> particularly interested in the line number). >>> >> The actual error is a bit tricky to find but if you monitor the docker >> logs from within the taskmanager pod you can find it failing when the SDK >> harness boot.go attempts to pull the the artifacts from the artifact >> endpoint [1] >> [1] >> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139 >> >> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot >> --id=1-1 --provision_endpoint=localhost:45775 >> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve >> /tmp/staged in 3 attempts: failed to retrieve chunk for >> /tmp/staged/pickled_main_session >> caused by: >> rpc error: code = Unknown desc = ; failed to retrieve chunk for >> /tmp/staged/pickled_main_session >> >> I can hit the jobserver fine from my taskmanager pod, as well as from >> within a SDK container I spin up manually (with —network host): >> >> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping >> flink-beam-jobserver >> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) >> bytes of data. >> >> I don’t see how this would work if the endpoint hostname is localhost. >> I’ll explore how this is working in the flink-on-k8s-operator. >> >> Thanks for taking a look! >> Sam >> >> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kcwea...@google.com> wrote: >> >>> > The issue is that the jobserver does not provide the proper endpoints >>> to the SDK harness when it submits the job to flink. >>> >>> I would not be surprised if there was something weird going on with >>> Docker in Docker. The defaults mostly work fine when an external SDK >>> harness is used [1]. >>> >>> Can you provide more information on the exception you got? (I'm >>> particularly interested in the line number). >>> >>> > The issue is that the jobserver does not provide the proper endpoints >>> to the SDK harness when it submits the job to flink. >>> >>> More information about this failure mode would be helpful as well. >>> >>> [1] >>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml >>> >>> >>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <samb...@gmail.com> wrote: >>> >>>> Hello beam community! >>>> >>>> I’m looking for some help solving an issue running a beam job on flink >>>> using --environment_type DOCKER. >>>> >>>> I have a flink cluster running in kubernetes configured so the >>>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver >>>> in the cluster. The issue is that the jobserver does not provide the proper >>>> endpoints to the SDK harness when it submits the job to flink. It typically >>>> provides something like localhost:34567 using the hostname the grpc >>>> server was bound to. There is a jobserver flag --job-host that will >>>> bind the grpc server to this provided hostname, but I cannot seem to get it >>>> to bind to the k8s jobservice Service name [2]. I’ve tried different >>>> flavors of FQDNs but haven’t had any luck. >>>> >>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - >>>> ArtifactStagingService started on flink-beam-jobserver:8098 >>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - >>>> Exception during job server creation >>>> java.io.IOException: Failed to bind >>>> ... >>>> >>>> Does anyone have some experience with this that could help provide some >>>> guidance? >>>> >>>> Cheers, >>>> Sam >>>> >>>> [1] https://github.com/sambvfx/beam-flink-k8s >>>> [2] >>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29 >>>> >>>