It was my understanding that the client first uploads the artifacts to the
jobserver and then the SDK harness will pull in these artifacts from the
jobserver over a gRPC port.

I see the artifacts on the jobserver while the job is attempting to run:

root@flink-beam-jobserver-9fccb99b8-6mhtq
:/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4

Do the jobserver and the taskmanager need to share the artifact staging
volume?

On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver <kcwea...@google.com> wrote:

> > rpc error: code = Unknown desc = ; failed to retrieve chunk for
> /tmp/staged/pickled_main_session
>
> Are you sure that's due to a networking issue, and not a problem with the
> filesystem / volume mounting?
>
> On Tue, Sep 22, 2020 at 3:55 PM Sam Bourne <samb...@gmail.com> wrote:
>
>> I would not be surprised if there was something weird going on with
>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>> harness is used [1].
>>>
>> Can you provide more information on the exception you got? (I'm
>>> particularly interested in the line number).
>>>
>> The actual error is a bit tricky to find but if you monitor the docker
>> logs from within the taskmanager pod you can find it failing when the SDK
>> harness boot.go attempts to pull the the artifacts from the artifact
>> endpoint [1]
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L139
>>
>> 2020/09/22 22:07:51 Initializing python harness: /opt/apache/beam/boot 
>> --id=1-1 --provision_endpoint=localhost:45775
>> 2020/09/22 22:07:59 Failed to retrieve staged files: failed to retrieve 
>> /tmp/staged in 3 attempts: failed to retrieve chunk for 
>> /tmp/staged/pickled_main_session
>>     caused by:
>> rpc error: code = Unknown desc = ; failed to retrieve chunk for 
>> /tmp/staged/pickled_main_session
>>
>> I can hit the jobserver fine from my taskmanager pod, as well as from
>> within a SDK container I spin up manually (with —network host):
>>
>> root@flink-taskmanager-56c6bdb6dd-49md7:/opt/apache/beam# ping 
>> flink-beam-jobserver
>> PING flink-beam-jobserver.default.svc.cluster.local (10.103.16.129) 56(84) 
>> bytes of data.
>>
>> I don’t see how this would work if the endpoint hostname is localhost.
>> I’ll explore how this is working in the flink-on-k8s-operator.
>>
>> Thanks for taking a look!
>> Sam
>>
>> On Tue, Sep 22, 2020 at 2:48 PM Kyle Weaver <kcwea...@google.com> wrote:
>>
>>> > The issue is that the jobserver does not provide the proper endpoints
>>> to the SDK harness when it submits the job to flink.
>>>
>>> I would not be surprised if there was something weird going on with
>>> Docker in Docker. The defaults mostly work fine when an external SDK
>>> harness is used [1].
>>>
>>> Can you provide more information on the exception you got? (I'm
>>> particularly interested in the line number).
>>>
>>> > The issue is that the jobserver does not provide the proper endpoints
>>> to the SDK harness when it submits the job to flink.
>>>
>>> More information about this failure mode would be helpful as well.
>>>
>>> [1]
>>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_job_server.yaml
>>>
>>>
>>> On Tue, Sep 22, 2020 at 10:36 AM Sam Bourne <samb...@gmail.com> wrote:
>>>
>>>> Hello beam community!
>>>>
>>>> I’m looking for some help solving an issue running a beam job on flink
>>>> using --environment_type DOCKER.
>>>>
>>>> I have a flink cluster running in kubernetes configured so the
>>>> taskworkers can run docker [1]. I’m attempting to deploy a flink jobserver
>>>> in the cluster. The issue is that the jobserver does not provide the proper
>>>> endpoints to the SDK harness when it submits the job to flink. It typically
>>>> provides something like localhost:34567 using the hostname the grpc
>>>> server was bound to. There is a jobserver flag --job-host that will
>>>> bind the grpc server to this provided hostname, but I cannot seem to get it
>>>> to bind to the k8s jobservice Service name [2]. I’ve tried different
>>>> flavors of FQDNs but haven’t had any luck.
>>>>
>>>> [main] INFO org.apache.beam.runners.jobsubmission.JobServerDriver - 
>>>> ArtifactStagingService started on flink-beam-jobserver:8098
>>>> [main] WARN org.apache.beam.runners.jobsubmission.JobServerDriver - 
>>>> Exception during job server creation
>>>> java.io.IOException: Failed to bind
>>>> ...
>>>>
>>>> Does anyone have some experience with this that could help provide some
>>>> guidance?
>>>>
>>>> Cheers,
>>>> Sam
>>>>
>>>> [1] https://github.com/sambvfx/beam-flink-k8s
>>>> [2]
>>>> https://github.com/sambvfx/beam-flink-k8s/blob/master/k8s/beam-flink-jobserver.yaml#L29
>>>>
>>>

Reply via email to