Re: Udf Performance and Object Creation

Flavio Pompermaier Fri, 14 Aug 2015 09:10:30 -0700

Hi Stephan thanks for the reply!
Now it's more clear..if I understood correctly map and mapPartition are the
same iff I have only one slot per task manager, right?


I was convinced to have post those questions in this thread as 3rd or 4th
message..isn't it?
On 14 Aug 2015 17:57, "Stephan Ewen" <se...@apache.org> wrote:

> Hi!
>
> (1) A mapper is created once per parallel task. So if you create a program
> that runs a map() transformation with a parallelism of n, you will have n
> mapper instances in the cluster. Some may be on the same TaskManager, if
> the TaskManager has multiple slots.
>
> (2) I would really like that. But it means Java has to deal with both
> managed and unmanaged memory at the same time, which is quite a heavy
> addition. C# has some form of support for that.
>
> BTW: Where did you originally post these questions? I have not seen them
> before...
>
> On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> Any insight about these 2 questions..?
>> On 12 Aug 2015 17:38, "Flavio Pompermaier" <pomperma...@okkam.it> wrote:
>>
>>> This is something I've never understood in depth: isn't a mapper created
>>> for each record?if it's created only once per task manager then it's not so
>>> different from mapPartition..what I'm missing here?
>>>
>>> And then a more philosophic question: all big data framework requires
>>> somehow to manage memory very efficiently (Flink has even though to reserve
>>> a fraction of the entire memory in order to have control over it). Wouldn't
>>> be simpler if java would finally release some APIs (even marked as unsafe,
>>> it doesn't change theMat much) to allow for a full control of the
>>> memory..?it will make a lot of sense for all big data platforms (at least
>>> for non-UDF code...).
>>>
>>> Best,
>>> Flavio
>>> On 12 Aug 2015 12:44, "Timo Walther" <twal...@apache.org> wrote:
>>>
>>>> Hello Michael,
>>>>
>>>> every time you code a Java program you should avoid object creation if
>>>> you want an efficient program, because every created object needs to be
>>>> garbage collected later (which slows down your program performance).
>>>> You can have small Pojos, just try to avoid the call "new" in your
>>>> functions:
>>>>
>>>> Instead of:
>>>>
>>>> class Mapper implements MapFunction<String,Pojo> {
>>>> public Pojo map(String s) {
>>>>     Pojo p = new Pojo();
>>>>     p.f = s;
>>>> }
>>>> }
>>>>
>>>> do:
>>>>
>>>> class Mapper implements MapFunction<String,Pojo> {
>>>> private Pojo p = new Pojo();
>>>> public Pojo map(String s) {
>>>>     p.f = s;
>>>> }
>>>> }
>>>>
>>>> Then an object is only created once per Mapper and not per record.
>>>>
>>>> Hope this helps.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>>
>>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> I have a question about the programming of user defined functions, is
>>>>> it still like in old Stratosphere times the case that object creation
>>>>> should be avoided al all cost? Because in some of the examples there are
>>>>> now Tuples and other objects created before returning them.
>>>>>
>>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>>> Pojos. Is it performance wise a big improvement to define one big pojo 
>>>>> that
>>>>> can be used by all the steps or better to have smaller ones to send less
>>>>> data but create more objects.
>>>>>
>>>>> Thanks
>>>>> Michael
>>>>>
>>>>
>>>>
>

Re: Udf Performance and Object Creation

Reply via email to