Re: Execution environments for testing: local vs collection vs mini cluster

Biao Liu Fri, 26 Jul 2019 01:32:42 -0700

Hi Juan,

Sorry for the late reply.


1. the environments of data stream and data set are not same. An obvious
difference is there always be a "stream" prefix of environment for data
stream. For example, StreamExecutionEnvironment is for data stream,
ExecutionEnvironment and CollectionEnvironment are for data set.

You could use "StreamExecutionEnvironment.createLocalEnvironment" to run or
test a data stream job. Use ExecutionEnvironment.createLocalEnvironment or
CollectionEnvironment to run or test a data set job.

Actually you could also use
StreamExecutionEnvironment.getExecutionEnvironment
or ExecutionEnvironment.getExecutionEnvironment. Because they would choose
local environment automatically if you are running job standalone (in IDE
or execute the main method directly).

2. Regarding to MiniCluster, IMO it's a bit internal. The MiniCluster runs
as backend behind local environment. I think there is a subtle difference
of the position between mini cluster of Flink and mini cluster of Hadoop.

3. I will try to answer your questions below.

> Which test execution environment is recommended for each test use case?
It depends on which mode you are testing, data stream or data set.

> For example I don't see why would I use CollectionEnvironment when I have
the local environment available and running on several threads, what is a
good use case for CollectionEnvironment?
In the official document, it says "CollectionEnvironment is a low-overhead
approach for executing Flink programs". As I don't have much experience of
data set, I just check the relevant codes. The CollectionEnvironment seems
not to start a mini cluster. I believe it executes job in a lighter way.
BTW, There is no such an equivalent environment for data stream.

> Are all these 3 environments supported equality, or maybe some of them is
expected to be deprecated?
Obviously they are not same as mentioned above.
If a class is deprecated, it would be decorated by an annotation
"Deprecated".

> Are there any additional execution environments that could be useful for
testing on a single host?
I would suggest to follow the official documents [1][2] which you have
discovered, even there might be some other ways which seem to be
equivalent. Because if you depend on some internal implementation, it might
be changed over time without any notification.


1.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing
2.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/local_execution.html


On Tue, Jul 23, 2019 at 11:30 PM Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:

> Hi Bao,
>
> Thanks for your answer.
>
> 1. Integration tests for my project.
> 2. Both data stream and data sets
>
>
>
> On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <mmyy1...@gmail.com> wrote:
>
>> Hi Juan,
>>
>> I'm not sure what you really want. Before giving some suggestions, could
>> you answer the questions below first?
>>
>> 1. Do you want to write a unit test (or integration test) case for your
>> project or for Flink? Or just want to run your job locally?
>> 2. Which mode do you want to test? DataStream or DataSet?
>>
>>
>>
>> Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com> 于2019年7月23日周二
>> 下午1:12写道：
>>
>>> Hi,
>>>
>>> In
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html
>>> and
>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html
>>> I see there are 3 ways to create an execution environment for testing:
>>>
>>>    - StreamExecutionEnvironment.createLocalEnvironment and
>>>    ExecutionEnvironment.createLocalEnvironment create an execution 
>>> environment
>>>    running on a single JVM using different threads.
>>>    - CollectionEnvironment runs on a single JVM on a single thread.
>>>    - I haven't found not much documentation on the Mini Cluster, but it
>>>    sounds similar to the Hadoop MiniCluster
>>>    
>>> <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CLIMiniCluster.html>.
>>>    If that is then case, then it would run on many local JVMs, each of them
>>>    running multiple threads.
>>>
>>> Am I correct about the Mini Cluster? Is there any additional
>>> documentation about it? I discovered it looking at the source code of
>>> AbstractTestBase, that is mentioned on
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing.
>>> Also, it looks like launching the mini cluster registers it somewhere, so
>>> subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment`
>>> return an environment that uses the mini cluster. Is that performed by
>>> `executionEnvironment.setAsContext()` in
>>> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56
>>> ? Is that execution environment registration process documented anywhere?
>>>
>>> Which test execution environment is recommended for each test use case?
>>> For example I don't see why would I use CollectionEnvironment when I have
>>> the local environment available and running on several threads, what is a
>>> good use case for CollectionEnvironment?
>>>
>>> Are all these 3 environments supported equality, or maybe some of them
>>> is expected to be deprecated?
>>>
>>> Are there any additional execution environments that could be useful for
>>> testing on a single host?
>>>
>>> Thanks,
>>>
>>> Juan
>>>
>>>
>>>

Re: Execution environments for testing: local vs collection vs mini cluster

Reply via email to