Re: Tuples serialization

Stephan Ewen Fri, 24 Apr 2015 01:34:17 -0700

I think you need not create any TypeInformation anyways. It is always
present in the data set.


DataSet<Tuple2<String, integer>> myTuples = ...;

myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>());

On Fri, Apr 24, 2015 at 10:20 AM, Fabian Hueske <fhue...@gmail.com> wrote:

> The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable
> blocks. Since this is a binary file format, the InputFormat does not know
> where a new record starts. When writing such a file, each block starts with
> a new record and is filled until no more records fit completely in. The
> remaining space until the next block border is padded.
>
> As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record size
> and Input and OutputFormats use the same setting, the parameter has only
> performance implications. Smaller settings waste more space but allow for
> higher read parallelism (too much parallelism causes scheduling overhead).
> I'd simply set it to 64MB and experiment with smaller and larger settings
> if performance is a major concern here.
>
> You don't need to create a sample tuple to create a TypeInformation for it.
>
> private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new
> TupleTypeInfo<Tuple2<String, byte[]>>(
>       BasicTypeInfo.STRING_TYPE_INFO,
>       PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO
> );
>
> 2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
>> I managed to read and write avro files and still I have two doubts:
>>
>> Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY?
>> Do I have really to create a sample tuple to extract the TypeInformation
>> to instantiate the TypeSerializerInputFormat?
>>
>> On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> I've searched within flink for a working example of 
>>> TypeSerializerOutputFormat
>>> usage but I didn't find anything usable.
>>> Cold you show me a simple snippet of code?
>>> Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY?
>>> Which size do I have to use? Will flink write a single file or a set of
>>> avro file in a directory?
>>> Is it possible to read all files in a directory at once?
>>>
>>> On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <fhue...@gmail.com>
>>> wrote:
>>>
>>>> Have you tried the TypeSerializerOutputFormat?
>>>> This will serialize data using Flink's own serializers and write it to
>>>> binary files.
>>>> The data can be read back using the TypeSerializerInputFormat.
>>>>
>>>> Cheers, Fabian
>>>>
>>>> 2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>
>>>>> Hi to all,
>>>>>
>>>>> in my use case I'd like to persist within a directory batch of
>>>>> Tuple2<String, byte[]>.
>>>>> Which is the most efficient way to achieve that in Flink?
>>>>> I was thinking to use Avro but I can't find an example of how to do
>>>>> that.
>>>>> Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]>
>>>>> from it?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>
>>>
>

Re: Tuples serialization

Reply via email to