Hi Thomas, Thanks for your explanation. I understand your original intention. I will seriously consider this issue. After I have the initial solution, I will bring up a further discussion in Beam ML.
Thanks for your voting. :) Best, Jincheng Thomas Weise <t...@apache.org> 于2019年10月25日周五 上午7:32写道: > Hi Jincheng, > > Yes, this topic can be further discussed on the Beam ML. The only reason I > brought it up here is that it would be desirable from Beam Flink runner > perspective for the artifact staging mechanism that you work on to be > reusable. > > Stage 1 in Beam is also up to the runner, artifact staging is a service > discovered from the job server and that the Flink job server currently uses > DFS is not set in stone. My interest was more regarding assumptions > regarding the artifact structure, which may or may not allow for reusable > implementation. > > +1 for the proposal otherwise > > Thomas > > > On Mon, Oct 21, 2019 at 8:40 PM jincheng sun <sunjincheng...@gmail.com> > wrote: > > > Hi Thomas, > > > > Thanks for sharing your thoughts. I think improve and solve the > limitations > > of the Beam artifact staging is good topic(For beam). > > > > As I understand it as follows: > > > > For Beam(data): > > Stage1: BeamClient ------> JobService (data will be upload to DFS). > > Stage2: JobService(FlinkClient) ------> FlinkJob (operator download > > the data from DFS) > > Stage3: Operator ------> Harness(artifact staging service) > > > > For Flink(data): > > Stage1: FlinkClient(data(local) upload to BlobServer using distribute > > cache) ------> Operator (data will be download from BlobServer). Do not > > have to depend on DFS. > > Stage2: Operator ------> Harness(for docker we using artifact staging > > service) > > > > So, I think Beam have to depend on DFS in Stage1. and Stage2 can using > > distribute cache if we remove the dependency of DFS for Beam in > Stage1.(Of > > course we need more detail here), we can bring up the discussion in a > > separate Beam dev@ ML, the current discussion focuses on Flink 1.10 > > version > > of UDF Environment and Dependency Management for python, so I recommend > > voting in the current ML for Flink 1.10, Beam artifact staging > improvements > > are discussed in a separate Beam dev@. > > > > What do you think? > > > > Best, > > Jincheng > > > > Thomas Weise <t...@apache.org> 于2019年10月21日周一 下午10:25写道: > > > > > Beam artifact staging currently relies on shared file system and there > > are > > > limitations, for example when running locally with Docker and local FS. > > It > > > sounds like a distributed cache based implementation might be a good > > > (better?) option for artifact staging even for the Beam Flink runner? > > > > > > If so, can the implementation you propose be compatible with the Beam > > > artifact staging service so that it can be plugged into the Beam Flink > > > runner? > > > > > > Thanks, > > > Thomas > > > > > > > > > On Mon, Oct 21, 2019 at 2:34 AM jincheng sun <sunjincheng...@gmail.com > > > > > wrote: > > > > > > > Hi Max, > > > > > > > > Sorry for the late reply. Regarding the issue you mentioned above, > I'm > > > glad > > > > to share my thoughts: > > > > > > > > > For process-based execution we use Flink's cache distribution > instead > > > of > > > > Beam's artifact staging. > > > > > > > > In current design, we use Flink's cache distribution to upload users' > > > files > > > > from client to cluster in both docker mode and process mode. That is, > > > > Flink's cache distribution and Beam's artifact staging service work > > > > together in docker mode. > > > > > > > > > > > > > Do we want to implement two different ways of staging artifacts? It > > > seems > > > > sensible to use the same artifact staging functionality also for the > > > > process-based execution. > > > > > > > > I agree that the implementation will be simple if we use the same > > > artifact > > > > staging functionality also for the process-based execution. However, > > it's > > > > not the best for performance as it will introduce an additional > network > > > > transmission, as in process mode TaskManager and python worker share > > the > > > > same environment, in which case the user files in Flink Distribute > > Cache > > > > can be accessed by python worker directly. We do not need the staging > > > > service in this case. > > > > > > > > > Apart from being simpler, this would also allow the process-based > > > > execution to run in other environments than the Flink TaskManager > > > > environment. > > > > > > > > IMHO, this case is more like docker mode, and we can share or reuse > the > > > > code of Beam docker mode. Furthermore, in this case python worker is > > > > launched by the operator, so it is always in the same environment as > > the > > > > operator. > > > > > > > > Thanks again for your feedback, and it is valuable for find out the > > final > > > > best architecture. > > > > > > > > Feel free to correct me if there is anything incorrect. > > > > > > > > Best, > > > > Jincheng > > > > > > > > Maximilian Michels <m...@apache.org> 于2019年10月16日周三 下午4:23写道: > > > > > > > > > I'm also late to the party here :) When I saw the first draft, I > was > > > > > thinking how exactly the design doc would tie in with Beam. Thanks > > for > > > > > the update. > > > > > > > > > > A couple of comments with this regard: > > > > > > > > > > > Flink has provided a distributed cache mechanism and allows users > > to > > > > > upload their files using "registerCachedFile" method in > > > > > ExecutionEnvironment/StreamExecutionEnvironment. The python files > > users > > > > > specified through "add_python_file", "set_python_requirements" and > > > > > "add_python_archive" are also uploaded through this method > > eventually. > > > > > > > > > > For process-based execution we use Flink's cache distribution > instead > > > of > > > > > Beam's artifact staging. > > > > > > > > > > > Apache Beam Portability Framework already supports artifact > staging > > > > that > > > > > works out of the box with the Docker environment. We can use the > > > artifact > > > > > staging service defined in Apache Beam to transfer the dependencies > > > from > > > > > the operator to Python SDK harness running in the docker container. > > > > > > > > > > Do we want to implement two different ways of staging artifacts? It > > > > > seems sensible to use the same artifact staging functionality also > > for > > > > > the process-based execution. Apart from being simpler, this would > > also > > > > > allow the process-based execution to run in other environments than > > the > > > > > Flink TaskManager environment. > > > > > > > > > > Thanks, > > > > > Max > > > > > > > > > > On 15.10.19 11:13, Wei Zhong wrote: > > > > > > Hi Thomas, > > > > > > > > > > > > Thanks a lot for your suggestion! > > > > > > > > > > > > As you can see from the section "Goals" that this FLIP focuses on > > the > > > > > dependency management in process mode. However, the APIs and design > > > > > proposed in this FLIP also applies for the docker mode. So it makes > > > sense > > > > > to me to also describe how this design is integated to the artifact > > > > staging > > > > > service of Apache Beam in docker mode. I have updated the design > doc > > > and > > > > > looking forward to your feedback. > > > > > > > > > > > > Thanks, > > > > > > Wei > > > > > > > > > > > >> 在 2019年10月15日,01:54,Thomas Weise <t...@apache.org> 写道: > > > > > >> > > > > > >> Sorry for joining the discussion late. > > > > > >> > > > > > >> The Beam environment already supports artifact staging, it works > > out > > > > of > > > > > the > > > > > >> box with the Docker environment. I think it would be helpful to > > > > explain > > > > > in > > > > > >> the FLIP how this proposal relates to what Beam offers / how it > > > would > > > > be > > > > > >> integrated. > > > > > >> > > > > > >> Thanks, > > > > > >> Thomas > > > > > >> > > > > > >> > > > > > >> On Mon, Oct 14, 2019 at 8:09 AM Jeff Zhang <zjf...@gmail.com> > > > wrote: > > > > > >> > > > > > >>> +1 > > > > > >>> > > > > > >>> Hequn Cheng <chenghe...@gmail.com> 于2019年10月14日周一 下午10:55写道: > > > > > >>> > > > > > >>>> +1 > > > > > >>>> > > > > > >>>> Good job, Wei! > > > > > >>>> > > > > > >>>> Best, Hequn > > > > > >>>> > > > > > >>>> On Mon, Oct 14, 2019 at 2:54 PM Dian Fu < > dian0511...@gmail.com> > > > > > wrote: > > > > > >>>> > > > > > >>>>> Hi Wei, > > > > > >>>>> > > > > > >>>>> +1 (non-binding). Thanks for driving this. > > > > > >>>>> > > > > > >>>>> Thanks, > > > > > >>>>> Dian > > > > > >>>>> > > > > > >>>>>> 在 2019年10月14日,下午1:40,jincheng sun <sunjincheng...@gmail.com > > > > > 写道: > > > > > >>>>>> > > > > > >>>>>> +1 > > > > > >>>>>> > > > > > >>>>>> Wei Zhong <weizhong0...@gmail.com> 于2019年10月12日周六 下午8:41写道: > > > > > >>>>>> > > > > > >>>>>>> Hi all, > > > > > >>>>>>> > > > > > >>>>>>> I would like to start the vote for FLIP-78[1] which is > > > discussed > > > > > and > > > > > >>>>>>> reached consensus in the discussion thread[2]. > > > > > >>>>>>> > > > > > >>>>>>> The vote will be open for at least 72 hours. I'll try to > > close > > > it > > > > > by > > > > > >>>>>>> 2019-10-16 18:00 UTC, unless there is an objection or not > > > enough > > > > > >>>> votes. > > > > > >>>>>>> > > > > > >>>>>>> Thanks, > > > > > >>>>>>> Wei > > > > > >>>>>>> > > > > > >>>>>>> [1] > > > > > >>>>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-78%3A+Flink+Python+UDF+Environment+and+Dependency+Management > > > > > >>>>>>> < > > > > > >>>>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-78:+Flink+Python+UDF+Environment+and+Dependency+Management > > > > > >>>>>>>> > > > > > >>>>>>> [2] > > > > > >>>>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html > > > > > >>>>>>> < > > > > > >>>>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html > > > > > >>>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>> > > > > > >>> > > > > > >>> -- > > > > > >>> Best Regards > > > > > >>> > > > > > >>> Jeff Zhang > > > > > >>> > > > > > > > > > > > > > > > > > > > > >