Hi! I have some ideas, let me see if I can make them concrete until tomorrow...
Greetings, Stephan On Mon, Apr 27, 2015 at 5:29 PM, LINZ, Arnaud <al...@bouyguestelecom.fr> wrote: > Hi, > > I see. My Key class is an abstract class, which subclasses are Key1<?>, > Key2<?,?> etc, so it’s very like a tuple. It is heavily used in > “non-distributed” hash maps once the dataset is reduced to fit on a single > JVM. > > It exposes the common contract that I need (such as getHeadKey(), > getLastl(), or makeKey(Key,Object)) to “navigate” in the key space, and a > cached hash code to make hash maps faster. My generic algorithms do not > need to know how many fields are exposed in the Key, but they need to be > able to construct another key from two keys. > > > > Arnaud > > > > *De :* ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] *De la part > de* Stephan Ewen > *Envoyé :* vendredi 24 avril 2015 11:14 > *À :* user@flink.apache.org > *Objet :* Re: How to make a generic key for groupBy > > > > Hi Arnaud! > > > > Thank you for the warm words! Let's find a good way to get this to work... > > > > As a bit of background: > > In Flink, the API needs to now a bit about the types that go through the > functions, because Flink pre-generates and configures serializers, and > validates that things fit together. > > > > It is also important that keys are exposed rather specifically, because > Flink internally tries to work on serialized data (that makes it in-memory > operations predictable and robust). > > > > If you expose a key as a "String", or "long" or "double", then Flink knows > how to work on it in a binary fashion. > > Also, if you expose a key as a POJO, then Flink interprets the key as a > combination of the fields, and can again work on the serialized data. > > > > If you only expose "Comparable" (which is the bare minimum for a key), you > experience performance degradation (most notably for sorts), because every > key operation involves serialization and deserialization. > > > > So the goal would be to expose the key properly. We can always hint to the > API what the key type is, precisely for the cases where the inference > cannot do it. > > - To understand things a bit better: What is your "Key" type? Is it an > abstract class, an interface, a generic parameter? > > > > > > Greetings, > > Stephan > > > > > > FYI: In Scala, this works actually quite a bit easier, since Scala does > preserve generic types. In Java, we built a lot of reflection tooling, but > there are cases where it is impossible to infer the types via reflection, > like yours. > > > > > > > > On Thu, Apr 23, 2015 at 6:35 PM, Soumitra Kumar <kumar.soumi...@gmail.com> > wrote: > > Will you elaborate on your use case? It would help to find out where > Flink shines. IMO, its a great project, but needs more differentiation from > Spark. > > > > On Thu, Apr 23, 2015 at 7:25 AM, LINZ, Arnaud <al...@bouyguestelecom.fr> > wrote: > > Hello, > > > > After a quite successful benchmark yesterday (Flink being about twice > faster than Spark on my use cases), I’ve turned instantly from spark-fan to > flink-fan – great job, committers! > > So I’ve decided to port my existing Spark tools to Flink. Happily, most of > the difficulty was renaming classes, packages and variables with “spark” in > them to something more neutral J > > > > However there is one easy thing in Spark I’m still wondering how to do in > Flink : generic keys. > > > > I’m trying to make a framework on which my applications are built. That > framework thus manipulate “generic types” representing the data, inheriting > from an abstract class with a common contract, let’s call it “Bean”. > > > > Among other things Bean exposes an abstract method > > *public* Key getKey(); > > > > Key being one of my core types used in several java algorithms. > > > > Let’s say I have the class : > > *public* *class* Framework<T *extends* Bean> *implements* Serializable { > > > > *public *DataSet<T> doCoolStuff(*final* DataSet<T> inputDataset) { > > // Group lines according to a key > > *final* UnsortedGrouping<YT> groupe = inputDataset.groupBy(*new* > KeySelector<T, Key>() { > > @Override > > *public* Key getKey(T record) { > > *return* record.getKey(); > > } > > }); > > (…) > > } > > } > > > > With Spark, a mapToPair works fine because all I have to do is implements > correctly hashCode() and equals() on my Key type. > > With Flink, Key is not recognized as a POJO object (well it is not) and > that does not work. > > > > I have tried to expose something like *public* Tuple getKeyAsTuple(); in Key > but Flink does not accept generic Tuples. I’ve tried to parameterize my > Tuple but Flink does not know how to infer > > the generic type value. > > > > So I’m wondering what is the best way to implement it. > > For now I have exposed something like *public* String getKeyAsString(); and > turned my generic treatment into : > > *final* UnsortedGrouping<YT> groupe = inputDataset.groupBy(*new* > KeySelector<T, String>() { > > @Override > > *public* String getKey(T record) { > > *return* record.getKey().getKeyAsString(); > > } > > }); > > But that “ASCII” representation is suboptimal. > > > > I thought of passing a key to tuple conversion lambda upon creation of the > Framework class but that would be boiler-plate code on the user’s end, > which I’m not fond of. > > > > So my questions are : > > - Is there a smarter way to do this ? > > - What kind of objects can be passed as a Key ? Is there an > Interface to respect ? > > - In the worst case, is byte[] ok as a Key ? (I can code the > serialization on the framework side…) > > > > > > Best regards, > > Arnaud > > > > > ------------------------------ > > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue responsable de son contenu ni de ses pièces > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si > vous n'êtes pas destinataire de ce message, merci de le détruire et > d'avertir l'expéditeur. > > The integrity of this message cannot be guaranteed on the Internet. The > company that sent this message cannot therefore be held liable for its > content nor attachments. Any unauthorized use or dissemination is > prohibited. If you are not the intended recipient of this message, then > please delete it and notify the sender. > > > > >