You are right. It contains the python package `pyflink` and some dependencies like py4j and cloudpickle but does not contain all relevant dependencies(e.g. `google.protobuf` as the error log shows, which I also reproduce in my own machine).
Best, Biao Geng Levan Huyen <lvhu...@gmail.com> 于2022年10月20日周四 19:53写道: > Thanks Biao. > > May I ask one more question: does the binary package on Apache site (e.g: > https://archive.apache.org/dist/flink/flink-1.15.2) contain the python > package `pyflink` and its dependencies? I guess the answer is no. > > Thanks and regards, > Levan Huyen > > On Thu, 20 Oct 2022 at 18:13, Biao Geng <biaoge...@gmail.com> wrote: > >> Hi Levan, >> Great to hear that your issue is resolved! >> For the follow-up question, I am not quite familiar with AWS EMR's >> configuration for flink but due to the error you attached, it looks like >> that pyflink may not ship some 'Google' dependencies in the Flink binary >> zip file and as a result, it will try to find it in your python >> environment. cc @hxbks...@gmail.com >> For now, to manage the complex python dependencies, the typical usage of >> pyflink in multiple node clusters for production is to create your venv and >> use it in your `flink run` command or in the python code. You can refer to >> this doc >> <https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/python/faq/#preparing-python-virtual-environment> >> for details. >> >> Best, >> Biao Geng >> >> Levan Huyen <lvhu...@gmail.com> 于2022年10月20日周四 14:11写道: >> >>> Hi Biao, >>> >>> Thanks for your help. That solved my issue. It turned out that in setup1 >>> (in EMR), I got apache-flink installed, but the package (and its >>> dependencies) are not in the directory `/usr/lib/python3.7/site-packages` >>> (corresponding to the python binary in `/usr/bin/python3`). For some >>> reason, the packages are in the current user's location (`~/.local/...) >>> which Flink did not look at. >>> >>> BTW, is there any way to use the pyflink shipped with the Flink binary >>> zip file that I downloaded from Apache's site? On EMR, such package is >>> included, and I feel it's awkward to have to install another version using >>> `pip install`. It will also be confusing about where to add the >>> dependencies jars. >>> >>> Thanks and regards, >>> Levan Huyen >>> >>> >>> On Thu, 20 Oct 2022 at 02:25, Biao Geng <biaoge...@gmail.com> wrote: >>> >>>> Hi Levan, >>>> >>>> For your setup1 & 2, it looks like the python environment is not ready. >>>> Have you tried python -m pip install apache-flink for the first 2 >>>> setups? >>>> For your setup3, as you are trying to use `flink run ...` command, it >>>> will try to connect to a launched flink cluster but I guess you did not >>>> launch the flink cluster. You can do `start-cluster.sh` first to launch a >>>> standalone flink cluster and then try the `flink run ...` command. >>>> For your setup4, the reason why it works well is that it will use the >>>> default mini cluster to run the pyflink job. So even you haven't started a >>>> standalone cluster, it can work as well. >>>> >>>> Best, >>>> Biao Geng >>>> >>>> Levan Huyen <lvhu...@gmail.com> 于2022年10月19日周三 17:07写道: >>>> >>>>> Hi, >>>>> >>>>> I'm new to PyFlink, and I couldn't run a basic example that shipped >>>>> with Flink. >>>>> This is the command I tried: >>>>> >>>>> ./bin/flink run -py examples/python/datastream/word_count.py >>>>> >>>>> Here below are the results I got with different setups: >>>>> >>>>> 1. On AWS EMR 6.8.0 (Flink 1.15.1): >>>>> *Error: No module named 'google'*I tried with the Flink shipped with >>>>> EMR, or the binary v1.15.1/v1.15.2 downloaded from Flink site. I got that >>>>> same error message in all cases. >>>>> >>>>> Traceback (most recent call last): >>>>> >>>>> File "/usr/lib64/python3.7/runpy.py", line 193, in >>>>> _run_module_as_main >>>>> >>>>> "__main__", mod_spec) >>>>> >>>>> File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code >>>>> >>>>> exec(code, run_globals) >>>>> >>>>> File >>>>> "/tmp/pyflink/622f19cc-6616-45ba-b7cb-20a1462c4c3f/a29eb7a4-fff1-4d98-9b38-56b25f97b70a/word_count.py", >>>>> line 134, in <module> >>>>> >>>>> word_count(known_args.input, known_args.output) >>>>> >>>>> File >>>>> "/tmp/pyflink/622f19cc-6616-45ba-b7cb-20a1462c4c3f/a29eb7a4-fff1-4d98-9b38-56b25f97b70a/word_count.py", >>>>> line 89, in word_count >>>>> >>>>> ds = ds.flat_map(split) \ >>>>> >>>>> File >>>>> "/usr/lib/flink/opt/python/pyflink.zip/pyflink/datastream/data_stream.py", >>>>> line 333, in flat_map >>>>> >>>>> File >>>>> "/usr/lib/flink/opt/python/pyflink.zip/pyflink/datastream/data_stream.py", >>>>> line 557, in process >>>>> >>>>> File >>>>> "/usr/lib/flink/opt/python/pyflink.zip/pyflink/fn_execution/flink_fn_execution_pb2.py", >>>>> line 23, in <module> >>>>> >>>>> ModuleNotFoundError: No module named 'google' >>>>> >>>>> org.apache.flink.client.program.ProgramAbortException: >>>>> java.lang.RuntimeException: Python process exits with code: 1 >>>>> >>>>> 2. On my Mac, without a virtual environment (so `*-pyclientexec >>>>> python3*` is included in the run command): got the same error as with >>>>> EMR, but there's a stdout line from `*print()*` in the Python script >>>>> >>>>> File >>>>> "/Users/lvh/dev/tmp/flink-1.15.2/opt/python/pyflink.zip/pyflink/datastream/data_stream.py", >>>>> line 557, in process >>>>> >>>>> File "<frozen zipimport>", line 259, in load_module >>>>> >>>>> File >>>>> "/Users/lvh/dev/tmp/flink-1.15.2/opt/python/pyflink.zip/pyflink/fn_execution/flink_fn_execution_pb2.py", >>>>> line 23, in <module> >>>>> >>>>> ModuleNotFoundError: No module named 'google' >>>>> >>>>> Executing word_count example with default input data set. >>>>> >>>>> Use --input to specify file input. >>>>> >>>>> org.apache.flink.client.program.ProgramAbortException: >>>>> java.lang.RuntimeException: Python process exits with code: 1 >>>>> >>>>> 3. On my Mac, with a virtual environment and Python package ` >>>>> *apache-flink`* installed: Flink tried to connect to localhost:8081 >>>>> (I don't know why), and failed with 'connection refused'. >>>>> >>>>> Caused by: >>>>> org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not >>>>> complete the operation. Number of retries has been exhausted. >>>>> >>>>> at >>>>> org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:395) >>>>> >>>>> ... 21 more >>>>> >>>>> Caused by: java.util.concurrent.CompletionException: >>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: >>>>> Connection refused: localhost/127.0.0.1:8081 >>>>> >>>>> 4. If I run that same example job using Python: `*python >>>>> word_count.py*` then it runs well. >>>>> >>>>> I tried with both v1.15.2 and 1.15.1, Python 3.7 & 3.8, and got the >>>>> same result. >>>>> >>>>> Could someone please help? >>>>> >>>>> Thanks. >>>>> >>>>