The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable blocks. Since this is a binary file format, the InputFormat does not know where a new record starts. When writing such a file, each block starts with a new record and is filled until no more records fit completely in. The remaining space until the next block border is padded.
As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record size and Input and OutputFormats use the same setting, the parameter has only performance implications. Smaller settings waste more space but allow for higher read parallelism (too much parallelism causes scheduling overhead). I'd simply set it to 64MB and experiment with smaller and larger settings if performance is a major concern here. You don't need to create a sample tuple to create a TypeInformation for it. private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new TupleTypeInfo<Tuple2<String, byte[]>>( BasicTypeInfo.STRING_TYPE_INFO, PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO ); 2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > I managed to read and write avro files and still I have two doubts: > > Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY? > Do I have really to create a sample tuple to extract the TypeInformation > to instantiate the TypeSerializerInputFormat? > > On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> I've searched within flink for a working example of >> TypeSerializerOutputFormat >> usage but I didn't find anything usable. >> Cold you show me a simple snippet of code? >> Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? Which >> size do I have to use? Will flink write a single file or a set of avro file >> in a directory? >> Is it possible to read all files in a directory at once? >> >> On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <fhue...@gmail.com> >> wrote: >> >>> Have you tried the TypeSerializerOutputFormat? >>> This will serialize data using Flink's own serializers and write it to >>> binary files. >>> The data can be read back using the TypeSerializerInputFormat. >>> >>> Cheers, Fabian >>> >>> 2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>> >>>> Hi to all, >>>> >>>> in my use case I'd like to persist within a directory batch of >>>> Tuple2<String, byte[]>. >>>> Which is the most efficient way to achieve that in Flink? >>>> I was thinking to use Avro but I can't find an example of how to do >>>> that. >>>> Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> >>>> from it? >>>> >>>> Best, >>>> Flavio >>>> >>> >>