Re: Tuples serialization

Stephan Ewen Fri, 24 Apr 2015 01:38:12 -0700

For the Input Side:

The data set myTuples has its type via "myTuples.getType()".
The TypeSerializerOutputFormat implements a special interface that picks up
that type automatically.



If you want to use the type serializer input, you can always do it like
this:

DataSet<Tuple2<String, integer>> myTuples = ...;
myTuples.output(new TypeSerializerOutputFormat<Tuple2<String,
integer>>(filePath));
env.execute();


TypeInformation<Tuple2<String, integer>> type = myTuples.getType();

DataSet<Tuple2<String, integer>> input = env.createInput(filePath, new
TypeSerializerInputFormat<Tuple2<String, integer>>(type));







On Fri, Apr 24, 2015 at 10:33 AM, Stephan Ewen <se...@apache.org> wrote:

> I think you need not create any TypeInformation anyways. It is always
> present in the data set.
>
> DataSet<Tuple2<String, integer>> myTuples = ...;
>
> myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>());
>
> On Fri, Apr 24, 2015 at 10:20 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable
>> blocks. Since this is a binary file format, the InputFormat does not know
>> where a new record starts. When writing such a file, each block starts with
>> a new record and is filled until no more records fit completely in. The
>> remaining space until the next block border is padded.
>>
>> As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record
>> size and Input and OutputFormats use the same setting, the parameter has
>> only performance implications. Smaller settings waste more space but allow
>> for higher read parallelism (too much parallelism causes scheduling
>> overhead). I'd simply set it to 64MB and experiment with smaller and larger
>> settings if performance is a major concern here.
>>
>> You don't need to create a sample tuple to create a TypeInformation for
>> it.
>>
>> private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new
>> TupleTypeInfo<Tuple2<String, byte[]>>(
>>       BasicTypeInfo.STRING_TYPE_INFO,
>>       PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO
>> );
>>
>> 2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>
>>> I managed to read and write avro files and still I have two doubts:
>>>
>>> Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY?
>>> Do I have really to create a sample tuple to extract the TypeInformation
>>> to instantiate the TypeSerializerInputFormat?
>>>
>>> On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
>>>> I've searched within flink for a working example of 
>>>> TypeSerializerOutputFormat
>>>> usage but I didn't find anything usable.
>>>> Cold you show me a simple snippet of code?
>>>> Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY?
>>>> Which size do I have to use? Will flink write a single file or a set of
>>>> avro file in a directory?
>>>> Is it possible to read all files in a directory at once?
>>>>
>>>> On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <fhue...@gmail.com>
>>>> wrote:
>>>>
>>>>> Have you tried the TypeSerializerOutputFormat?
>>>>> This will serialize data using Flink's own serializers and write it to
>>>>> binary files.
>>>>> The data can be read back using the TypeSerializerInputFormat.
>>>>>
>>>>> Cheers, Fabian
>>>>>
>>>>> 2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>>
>>>>>> Hi to all,
>>>>>>
>>>>>> in my use case I'd like to persist within a directory batch of
>>>>>> Tuple2<String, byte[]>.
>>>>>> Which is the most efficient way to achieve that in Flink?
>>>>>> I was thinking to use Avro but I can't find an example of how to do
>>>>>> that.
>>>>>> Once generated how can I (re)generate a Dataset<Tuple2<String,
>>>>>> byte[]> from it?
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>
>>>>
>>
>

Re: Tuples serialization

Reply via email to