Re: Parallelism question

Maximilian Michels Tue, 14 Apr 2015 04:23:43 -0700

Hi Giacomo,

If you use a FileOutputFormat as a DataSink (e.g. as in
env.writeAsText("/path"), then the output directory will contain 5 files
named 1, 2, 3, 4, and 5, each containing the output of the corresponding
task. The order of the data in the files follows the order of the
distributed DataSet. You can locally sort a partition by a key using
sortPartition(..) command. This is only available in 0.9.0-milestone-1 and
0.9-snapshot.


Best,
Max



On Tue, Apr 14, 2015 at 12:12 PM, Giacomo Licari <giacomo.lic...@gmail.com>
wrote:

> Hi Max,
> thank you for your reply.
>
> DataSink contains data ordered, I mean, it contains in order output1,
> output1 ... output5? Or are them mixed?
>
> Thanks a lot,
> Giacomo
>
> On Tue, Apr 14, 2015 at 11:58 AM, Maximilian Michels <m...@apache.org>
> wrote:
>
>> Hi Giacomo,
>>
>> If I understand you correctly, you want your Flink job to execute with a
>> parallelism of 5. Just call setDegreeOfParallelism(5) on your
>> ExecutionEnvironment. That way, all operations, when possible, will be
>> performed using 5 parallel instances. This is also true for the DataSink
>> which will produce 5 files containing the output data from the parallel
>> instances.
>>
>> Best,
>> Max
>>
>>
>> On Tue, Apr 14, 2015 at 10:38 AM, Giacomo Licari <
>> giacomo.lic...@gmail.com> wrote:
>>
>>> Hi guys,
>>> I have a question about how parallelism works.
>>>
>>> If I have a large dataset and I would divide it into 5 blocks, can I
>>> pass each block of data to a fixed parallel process (for example I set up 5
>>> process) ?
>>>
>>> And if the results data from each process arrive to the output not in an
>>> ordered way, can I order them? For example:
>>>
>>> data from process 1
>>> data from process 2
>>> and so on
>>>
>>> Thank you guys!
>>>
>>
>>
>

Re: Parallelism question

Reply via email to