Re: Job fails with FileNotFoundException from blobStore

2015-02-04 Thread Stephan Ewen
Hey Robert! On which version are you? 0.8 or 0.9- SNAPSHOT? Am 04.02.2015 14:49 schrieb "Robert Waury" : > Hi, > > I'm suddenly getting FileNotFoundExceptions because the blobStore cannot > find files in /tmp > > The job used work in the exact same setup (same versions, same cluster, > same input

Re: Expressing `grep` with many search terms in Flink

2015-02-04 Thread Fabian Hueske
Hi Stefan, Flink uses only one broadcast variable for all parallel tasks on one machine. Flink can also load the broadcast variable into a custom data structure. Have a look at the getBroadcastVariableWithInitializer() method: /** * Returns the result bound to the broadcast variable identified

Expressing `grep` with many search terms in Flink

2015-02-04 Thread Stefan Bunk
Hi Squirrels, I have some trouble expressing my use case in Flink terms, so I am asking for your help: I have five million documents and fourteen million search terms. For each search term I want to know, in how many documents it occurs. So basically a `grep` with very many search terms. My curre

January 2015 in the Flink community

2015-02-04 Thread Kostas Tzoumas
Here is a digestible read on some January activity in the Flink community: http://flink.apache.org/news/2015/02/04/january-in-flink.html Highlights: - Flink 0.8.0 was released - The Flink community published a technical roadmap for 2015 - Flink was used to scale matrix factorization to extreme

Job fails with FileNotFoundException from blobStore

2015-02-04 Thread Robert Waury
Hi, I'm suddenly getting FileNotFoundExceptions because the blobStore cannot find files in /tmp The job used work in the exact same setup (same versions, same cluster, same input files). Flink version: 0.8 release HDFS: 2.3.0-cdh5.1.2 Flink trace: http://pastebin.com/SKdwp6Yt Any idea what cou

Re: CSV input with unknown # of fields and Custom output format

2015-02-04 Thread Stephan Ewen
Nice! BTW: The TypeSerializerInputFormat just changed (in the 0.9-SNAPSHOT master) so that it now takes the type information, rather than a type serializer... Stephan On Wed, Feb 4, 2015 at 11:52 AM, Vinh June wrote: > Thanks, > I just tried and it works with scala also. > > Small notice for

Re: CSV input with unknown # of fields and Custom output format

2015-02-04 Thread Vinh June
Thanks, I just tried and it works with scala also. Small notice for anyone who mights interested is that the constructor of TypeSerializerInputFormat needs a TypeSerializer, not a TypeInformation. So this would work in Scala: [SCALA] val readback = env .

Re: CSV input with unknown # of fields and Custom output format

2015-02-04 Thread Stephan Ewen
Hi! I would go with the TypeSerializerInputFormat. Here is a code sample (in Java, Scala should work the same way): DataSet dataSet = ...; // write it out dataSet.write(new TypeSerializerOutputFormat(), "path"); // read it in DataS

Re: CSV input with unknown # of fields and Custom output format

2015-02-04 Thread Vinh June
Hello Fabian and Stephan, Thank you guys for your reply @Fabian: Could you please be kind enough to write a dummy example using SerializedOutputFormat and SerializedInputFormat, I tried with below instruction: dataset.write(new SerializedOutputFormat[MyClass], dataPath) but it doesn't work and

Re: CSV input with unknown # of fields and Custom output format

2015-02-04 Thread Stephan Ewen
Hi! Fabian refers to the "TypeSerializerOutputFormat" [1]. You can save your types in efficient binary representation by calling 'dataset.write(new TypeSerializerOutputFormat(), "/your/file/path"); ' Greetings, Stephan [1] https://github.com/apache/flink/blob/master/flink-java/src/main/java/or