Hi Thomas,

Thanks for your explanation. I understand your original intention. I will
seriously consider this issue. After I have the initial solution, I will
bring up a further discussion in Beam ML.

Thanks for your voting. :)

Best,
Jincheng


Thomas Weise <t...@apache.org> 于2019年10月25日周五 上午7:32写道:

> Hi Jincheng,
>
> Yes, this topic can be further discussed on the Beam ML. The only reason I
> brought it up here is that it would be desirable from Beam Flink runner
> perspective for the artifact staging mechanism that you work on to be
> reusable.
>
> Stage 1 in Beam is also up to the runner, artifact staging is a service
> discovered from the job server and that the Flink job server currently uses
> DFS is not set in stone. My interest was more regarding assumptions
> regarding the artifact structure, which may or may not allow for reusable
> implementation.
>
> +1 for the proposal otherwise
>
> Thomas
>
>
> On Mon, Oct 21, 2019 at 8:40 PM jincheng sun <sunjincheng...@gmail.com>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for sharing your thoughts. I think improve and solve the
> limitations
> > of the Beam artifact staging is good topic(For beam).
> >
> > As I understand it as follows:
> >
> > For Beam(data):
> >     Stage1: BeamClient ------> JobService (data will be upload to DFS).
> >     Stage2: JobService(FlinkClient) ------>  FlinkJob (operator download
> > the data from DFS)
> >     Stage3: Operator ------> Harness(artifact staging service)
> >
> > For Flink(data):
> >     Stage1: FlinkClient(data(local) upload to BlobServer using distribute
> > cache) ------> Operator (data will be download from BlobServer). Do not
> > have to depend on DFS.
> >     Stage2: Operator ------> Harness(for docker we using artifact staging
> > service)
> >
> > So, I think Beam have to depend on DFS in Stage1. and Stage2 can using
> > distribute cache if we remove the dependency of DFS for Beam in
> Stage1.(Of
> > course we need more detail here),  we can bring up the discussion in a
> > separate Beam dev@ ML, the current discussion focuses on Flink 1.10
> > version
> > of  UDF Environment and Dependency Management for python, so I recommend
> > voting in the current ML for Flink 1.10, Beam artifact staging
> improvements
> > are discussed in a separate Beam dev@.
> >
> > What do you think?
> >
> > Best,
> > Jincheng
> >
> > Thomas Weise <t...@apache.org> 于2019年10月21日周一 下午10:25写道:
> >
> > > Beam artifact staging currently relies on shared file system and there
> > are
> > > limitations, for example when running locally with Docker and local FS.
> > It
> > > sounds like a distributed cache based implementation might be a good
> > > (better?) option for artifact staging even for the Beam Flink runner?
> > >
> > > If so, can the implementation you propose be compatible with the Beam
> > > artifact staging service so that it can be plugged into the Beam Flink
> > > runner?
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Mon, Oct 21, 2019 at 2:34 AM jincheng sun <sunjincheng...@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Max,
> > > >
> > > > Sorry for the late reply. Regarding the issue you mentioned above,
> I'm
> > > glad
> > > > to share my thoughts:
> > > >
> > > > > For process-based execution we use Flink's cache distribution
> instead
> > > of
> > > > Beam's artifact staging.
> > > >
> > > > In current design, we use Flink's cache distribution to upload users'
> > > files
> > > > from client to cluster in both docker mode and process mode. That is,
> > > > Flink's cache distribution and Beam's artifact staging service work
> > > > together in docker mode.
> > > >
> > > >
> > > > > Do we want to implement two different ways of staging artifacts? It
> > > seems
> > > > sensible to use the same artifact staging functionality also for the
> > > > process-based execution.
> > > >
> > > > I agree that the implementation will be simple if we use the same
> > > artifact
> > > > staging functionality also for the process-based execution. However,
> > it's
> > > > not the best for performance as it will introduce an additional
> network
> > > > transmission, as in process mode TaskManager and python worker share
> > the
> > > > same environment, in which case the user files in Flink Distribute
> > Cache
> > > > can be accessed by python worker directly. We do not need the staging
> > > > service in this case.
> > > >
> > > > > Apart from being simpler, this would also allow the process-based
> > > > execution to run in other environments than the Flink TaskManager
> > > > environment.
> > > >
> > > > IMHO, this case is more like docker mode, and we can share or reuse
> the
> > > > code of Beam docker mode. Furthermore, in this case python worker is
> > > > launched by the operator, so it is always in the same environment as
> > the
> > > > operator.
> > > >
> > > > Thanks again for your feedback, and it is valuable for find out the
> > final
> > > > best architecture.
> > > >
> > > > Feel free to correct me if there is anything incorrect.
> > > >
> > > > Best,
> > > > Jincheng
> > > >
> > > > Maximilian Michels <m...@apache.org> 于2019年10月16日周三 下午4:23写道:
> > > >
> > > > > I'm also late to the party here :) When I saw the first draft, I
> was
> > > > > thinking how exactly the design doc would tie in with Beam. Thanks
> > for
> > > > > the update.
> > > > >
> > > > > A couple of comments with this regard:
> > > > >
> > > > > > Flink has provided a distributed cache mechanism and allows users
> > to
> > > > > upload their files using "registerCachedFile" method in
> > > > > ExecutionEnvironment/StreamExecutionEnvironment. The python files
> > users
> > > > > specified through "add_python_file", "set_python_requirements" and
> > > > > "add_python_archive" are also uploaded through this method
> > eventually.
> > > > >
> > > > > For process-based execution we use Flink's cache distribution
> instead
> > > of
> > > > > Beam's artifact staging.
> > > > >
> > > > > > Apache Beam Portability Framework already supports artifact
> staging
> > > > that
> > > > > works out of the box with the Docker environment. We can use the
> > > artifact
> > > > > staging service defined in Apache Beam to transfer the dependencies
> > > from
> > > > > the operator to Python SDK harness running in the docker container.
> > > > >
> > > > > Do we want to implement two different ways of staging artifacts? It
> > > > > seems sensible to use the same artifact staging functionality also
> > for
> > > > > the process-based execution. Apart from being simpler, this would
> > also
> > > > > allow the process-based execution to run in other environments than
> > the
> > > > > Flink TaskManager environment.
> > > > >
> > > > > Thanks,
> > > > > Max
> > > > >
> > > > > On 15.10.19 11:13, Wei Zhong wrote:
> > > > > > Hi Thomas,
> > > > > >
> > > > > > Thanks a lot for your suggestion!
> > > > > >
> > > > > > As you can see from the section "Goals" that this FLIP focuses on
> > the
> > > > > dependency management in process mode. However, the APIs and design
> > > > > proposed in this FLIP also applies for the docker mode. So it makes
> > > sense
> > > > > to me to also describe how this design is integated to the artifact
> > > > staging
> > > > > service of Apache Beam in docker mode. I have updated the design
> doc
> > > and
> > > > > looking forward to your feedback.
> > > > > >
> > > > > > Thanks,
> > > > > > Wei
> > > > > >
> > > > > >> 在 2019年10月15日,01:54,Thomas Weise <t...@apache.org> 写道:
> > > > > >>
> > > > > >> Sorry for joining the discussion late.
> > > > > >>
> > > > > >> The Beam environment already supports artifact staging, it works
> > out
> > > > of
> > > > > the
> > > > > >> box with the Docker environment. I think it would be helpful to
> > > > explain
> > > > > in
> > > > > >> the FLIP how this proposal relates to what Beam offers / how it
> > > would
> > > > be
> > > > > >> integrated.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Thomas
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Oct 14, 2019 at 8:09 AM Jeff Zhang <zjf...@gmail.com>
> > > wrote:
> > > > > >>
> > > > > >>> +1
> > > > > >>>
> > > > > >>> Hequn Cheng <chenghe...@gmail.com> 于2019年10月14日周一 下午10:55写道:
> > > > > >>>
> > > > > >>>> +1
> > > > > >>>>
> > > > > >>>> Good job, Wei!
> > > > > >>>>
> > > > > >>>> Best, Hequn
> > > > > >>>>
> > > > > >>>> On Mon, Oct 14, 2019 at 2:54 PM Dian Fu <
> dian0511...@gmail.com>
> > > > > wrote:
> > > > > >>>>
> > > > > >>>>> Hi Wei,
> > > > > >>>>>
> > > > > >>>>> +1 (non-binding). Thanks for driving this.
> > > > > >>>>>
> > > > > >>>>> Thanks,
> > > > > >>>>> Dian
> > > > > >>>>>
> > > > > >>>>>> 在 2019年10月14日,下午1:40,jincheng sun <sunjincheng...@gmail.com
> >
> > > 写道:
> > > > > >>>>>>
> > > > > >>>>>> +1
> > > > > >>>>>>
> > > > > >>>>>> Wei Zhong <weizhong0...@gmail.com> 于2019年10月12日周六 下午8:41写道:
> > > > > >>>>>>
> > > > > >>>>>>> Hi all,
> > > > > >>>>>>>
> > > > > >>>>>>> I would like to start the vote for FLIP-78[1] which is
> > > discussed
> > > > > and
> > > > > >>>>>>> reached consensus in the discussion thread[2].
> > > > > >>>>>>>
> > > > > >>>>>>> The vote will be open for at least 72 hours. I'll try to
> > close
> > > it
> > > > > by
> > > > > >>>>>>> 2019-10-16 18:00 UTC, unless there is an objection or not
> > > enough
> > > > > >>>> votes.
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks,
> > > > > >>>>>>> Wei
> > > > > >>>>>>>
> > > > > >>>>>>> [1]
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-78%3A+Flink+Python+UDF+Environment+and+Dependency+Management
> > > > > >>>>>>> <
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-78:+Flink+Python+UDF+Environment+and+Dependency+Management
> > > > > >>>>>>>>
> > > > > >>>>>>> [2]
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html
> > > > > >>>>>>> <
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Best Regards
> > > > > >>>
> > > > > >>> Jeff Zhang
> > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to