For the Input Side: The data set myTuples has its type via "myTuples.getType()". The TypeSerializerOutputFormat implements a special interface that picks up that type automatically.
If you want to use the type serializer input, you can always do it like this: DataSet<Tuple2<String, integer>> myTuples = ...; myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>(filePath)); env.execute(); TypeInformation<Tuple2<String, integer>> type = myTuples.getType(); DataSet<Tuple2<String, integer>> input = env.createInput(filePath, new TypeSerializerInputFormat<Tuple2<String, integer>>(type)); On Fri, Apr 24, 2015 at 10:33 AM, Stephan Ewen <se...@apache.org> wrote: > I think you need not create any TypeInformation anyways. It is always > present in the data set. > > DataSet<Tuple2<String, integer>> myTuples = ...; > > myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>()); > > On Fri, Apr 24, 2015 at 10:20 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable >> blocks. Since this is a binary file format, the InputFormat does not know >> where a new record starts. When writing such a file, each block starts with >> a new record and is filled until no more records fit completely in. The >> remaining space until the next block border is padded. >> >> As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record >> size and Input and OutputFormats use the same setting, the parameter has >> only performance implications. Smaller settings waste more space but allow >> for higher read parallelism (too much parallelism causes scheduling >> overhead). I'd simply set it to 64MB and experiment with smaller and larger >> settings if performance is a major concern here. >> >> You don't need to create a sample tuple to create a TypeInformation for >> it. >> >> private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new >> TupleTypeInfo<Tuple2<String, byte[]>>( >> BasicTypeInfo.STRING_TYPE_INFO, >> PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO >> ); >> >> 2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >> >>> I managed to read and write avro files and still I have two doubts: >>> >>> Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY? >>> Do I have really to create a sample tuple to extract the TypeInformation >>> to instantiate the TypeSerializerInputFormat? >>> >>> On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier < >>> pomperma...@okkam.it> wrote: >>> >>>> I've searched within flink for a working example of >>>> TypeSerializerOutputFormat >>>> usage but I didn't find anything usable. >>>> Cold you show me a simple snippet of code? >>>> Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? >>>> Which size do I have to use? Will flink write a single file or a set of >>>> avro file in a directory? >>>> Is it possible to read all files in a directory at once? >>>> >>>> On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <fhue...@gmail.com> >>>> wrote: >>>> >>>>> Have you tried the TypeSerializerOutputFormat? >>>>> This will serialize data using Flink's own serializers and write it to >>>>> binary files. >>>>> The data can be read back using the TypeSerializerInputFormat. >>>>> >>>>> Cheers, Fabian >>>>> >>>>> 2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>>> >>>>>> Hi to all, >>>>>> >>>>>> in my use case I'd like to persist within a directory batch of >>>>>> Tuple2<String, byte[]>. >>>>>> Which is the most efficient way to achieve that in Flink? >>>>>> I was thinking to use Avro but I can't find an example of how to do >>>>>> that. >>>>>> Once generated how can I (re)generate a Dataset<Tuple2<String, >>>>>> byte[]> from it? >>>>>> >>>>>> Best, >>>>>> Flavio >>>>>> >>>>> >>>> >> >