Thanks Fabian,

I have seen Side Outputs and OutputTags but not fully understood the
mechanics yet. In my case, I don't need to keep bad records... And I think
I will end up with flatMap() after all, it just becomes a internal
documentation issue to provide relevant information...

Thanks for your response.
Niclas

On Mon, Feb 19, 2018 at 8:46 PM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Niclas,
>
> I'd either add a Filter to directly discard bad records. That should make
> the behavior explicit.
> If you need to do complex transformations that you don't want to do twice,
> the FlatMap approach would be the most efficient.
> If you'd like to keep the bad records, you can implement a ProcessFunction
> and add a side output [1] that collects bad records.
>
> Hope this helps,
> Fabian
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.4/dev/stream/side_output.html
>
> 2018-02-19 10:29 GMT+01:00 Niclas Hedhman <nic...@apache.org>:
>
>> Hi again,
>>
>> something that I don't find (easily) in the documentation is what the
>> recommended method is to discard data from the stream.
>>
>> On one hand, I could always use flatMap(), even if it is "per message"
>> since that allows me to return zero or one objects.
>>
>> DataStream<MyType> stream =
>>     env.addSource( source )
>>        .flatMap( new MyFunction() )
>>
>>
>> But that seems a bit misleading, as the casual observer will get the idea
>> that MyFunction 'branches' out, but it doesn't.
>>
>> The other "obvious" choice is to return null and follow with a filter...
>>
>> DataStream<MyType> stream =
>>     env.addSource( source )
>>                .map( new MyFunction() )
>>                .filter( Objects::nonNull )
>>
>> BUT, that doesn't work with Java 8 method references like above, so I
>> have to create my own filter to get the type information correct to Flink;
>>
>> DataStream<MyType> stream =
>>     env.addSource( source )
>>                .map( new MyFunction() )
>>                .filter( new DiscardNullFilter<>() )
>>
>>
>> And in my opinion, that ends up looking ugly as the streams/pipeline (not
>> used to terminology yet) quickly have many transformations and branches,
>> and having a null check after each seems to put the burden of knowledge in
>> the wrong spot ("Can this function return null?")
>>
>> Throwing an exception is shutting down the entire stream, which seems
>> overly aggressive for many data related discards.
>>
>> Any other choices?
>>
>> Cheers
>> --
>> Niclas Hedhman, Software Developer
>> http://zest.apache.org - New Energy for Java
>>
>
>


-- 
Niclas Hedhman, Software Developer
http://zest.apache.org - New Energy for Java

Reply via email to