Re: Read once input data?

Saliya Ekanayake Tue, 16 Feb 2016 12:55:56 -0800

Thank you, yes, this makes sense. The broadcasted data in my case would a
large array of 3D coordinates,


On a side note, how can I take the output from a reduce function? I can see
methods to write it to a given output, but is it possible to retrieve the
reduced result back to the program - like a double value representing the
average in the previous example.


On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske <fhue...@gmail.com> wrote:

> You can use so-called BroadcastSets to send any sufficiently small DataSet
> (such as a computed average) to any other function and use it there.
> However, in your case you'll end up with a data flow that branches (at the
> source) and merges again (when the average is send to the second map).
> Such patterns can cause deadlocks and can therefore not be pipelined which
> means that the data before the branch is written to disk and read again.
> In your case it might be even better to read the data twice instead of
> reading, writing, and reading it.
>
> Fabian
>
> 2016-02-16 21:15 GMT+01:00 Saliya Ekanayake <esal...@gmail.com>:
>
>> I looked at the samples and I think what you meant is clear, but I didn't
>> find a solution for my need. In my case, I want to use the result from
>> first map operation before I can apply the second map on the *same* data
>> set. For simplicity, let's say I've a bunch of short values represented as
>> my data set. Then I need to find their average, so I use a map and reduce.
>> Then I want to map these short values with another function, but it needs
>> that average computed in the beginning to work correctly.
>>
>> Is this possible without doing multiple reads of the input data to create
>> the same dataset?
>>
>> Thank you,
>> saliya
>>
>> On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske <fhue...@gmail.com>
>> wrote:
>>
>>> Yes, if you implement both maps in a single job, data is read once.
>>>
>>> 2016-02-16 15:53 GMT+01:00 Saliya Ekanayake <esal...@gmail.com>:
>>>
>>>> Fabian,
>>>>
>>>> I've a quick follow-up question on what you suggested. When streaming
>>>> the same data through different maps, were you implying that everything
>>>> goes as single job in Flink, so data read happens only once?
>>>>
>>>> Thanks,
>>>> Saliya
>>>>
>>>> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske <fhue...@gmail.com>
>>>> wrote:
>>>>
>>>>> It is not possible to "pin" data sets in memory, yet.
>>>>> However, you can stream the same data set through two different
>>>>> mappers at the same time.
>>>>>
>>>>> For instance you can have a job like:
>>>>>
>>>>>                  /---> Map 1 --> SInk1
>>>>> Source --<
>>>>>                  \---> Map 2 --> SInk2
>>>>>
>>>>> and execute it at once.
>>>>> For that you define you data flow and call execute once after all
>>>>> sinks have been created.
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake <esal...@gmail.com>:
>>>>>
>>>>>> Fabian,
>>>>>>
>>>>>> count() was just an example. What I would like to do is say run two
>>>>>> map operations on the dataset (ds). Each map will have it's own 
>>>>>> reduction,
>>>>>> so is there a way to avoid creating two jobs for such scenario?
>>>>>>
>>>>>> The reason is, reading these binary matrices are expensive. In our
>>>>>> current MPI implementation, I am using memory maps for faster loading and
>>>>>> reuse.
>>>>>>
>>>>>> Thank you,
>>>>>> Saliya
>>>>>>
>>>>>> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske <fhue...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> it looks like you are executing two distinct Flink jobs.
>>>>>>> DataSet.count() triggers the execution of a new job. If you have an
>>>>>>> execute() call in your program, this will lead to two Flink jobs being
>>>>>>> executed.
>>>>>>> It is not possible to share state among these jobs.
>>>>>>>
>>>>>>> Maybe you should add a custom count implementation (using a
>>>>>>> ReduceFunction) which is executed in the same program as the other
>>>>>>> ReduceFunction.
>>>>>>>
>>>>>>> Best, Fabian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake <esal...@gmail.com>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I see that an InputFormat's open() and nextRecord() methods get
>>>>>>>> called for each terminal operation on a given dataset using that 
>>>>>>>> particular
>>>>>>>> InputFormat. Is it possible to avoid this - possibly using some caching
>>>>>>>> technique in Flink?
>>>>>>>>
>>>>>>>> For example, I've some code like below and I see for both the last
>>>>>>>> two statements (reduce() and count()) the above methods in the input 
>>>>>>>> format
>>>>>>>> get called. Btw. this is a custom input format I wrote to represent a
>>>>>>>> binary matrix stored as Short values.
>>>>>>>>
>>>>>>>> ShortMatrixInputFormat smif = new ShortMatrixInputFormat();
>>>>>>>>
>>>>>>>> DataSet<Short[]> ds = env.createInput(smif, 
>>>>>>>> BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);
>>>>>>>>
>>>>>>>> MapOperator<Short[], DoubleStatistics> op = ds.map(...)
>>>>>>>>
>>>>>>>> *op.reduce(...)*
>>>>>>>>
>>>>>>>> *op.count(...)*
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Saliya
>>>>>>>> --
>>>>>>>> Saliya Ekanayake
>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>> Indiana University, Bloomington
>>>>>>>> Cell 812-391-4914
>>>>>>>> http://saliya.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>> Cell 812-391-4914
>>>>>> http://saliya.org
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>> Cell 812-391-4914
>>>> http://saliya.org
>>>>
>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: Read once input data?

Reply via email to