Hi Julien, Could you give more details about the problems you faced? Here is a working example with Spark dataframe and Spark SQL: https://gist.github.com/radcheb/d16042d8bb3815d3dd42030ecedc43cf
Cheers, Radhwane Chebaane 2017-07-18 18:21 GMT+02:00 Julien CHAMP <jch...@tellmeplus.com>: > Hi Radhwane ! > > I've tested both your solutions using dataframe or spark sql... and in > both cases spark is stucked :/ > Did you test the code that you gave me ? I don't know if I've done > something wrong... > > Regards, > Julien > > Le lun. 10 juil. 2017 à 10:53, Radhwane Chebaane <r.cheba...@mindlytix.com> > a écrit : > >> Hi Julien, >> >> >> - Usually, windows functions require less shuffle than cross join so it's >> a little faster depending on use case. For large windows, cross join and >> window functions performances are close. >> - Use can use UDFs and UDAFs as in any Spark SQL request (Geometric Mean >> tested successfully). >> >> Regards, >> Radhwane >> >> 2017-07-06 16:22 GMT+02:00 Julien CHAMP <jch...@tellmeplus.com>: >> >>> Thx a lot for your answer Radhwane :) >>> >>> >>> I have some (many) use case with such needs of Long in window function. >>> As said in the bug report, I can store events in ms in a dataframe, and >>> want to count the number of events in past 10 years ( requiring a Long >>> value ) >>> >>> -> *Let's imagine that this window is used on timestamp values in ms : >>> I can ask for a window with a range between [-2160000000L, 0] and only have >>> a few values inside, not necessarily 2160000000L. I can understand the >>> limitaion for the rowBetween() method but the rangeBetween() method is nice >>> for this kind of usage.* >>> >>> >>> The solution with self join seems nice, but 2 questions : >>> >>> - regarding performances, will it be as fast as window function ? >>> >>> - can I use my own aggregate function ( for example a Geometric Mean ) >>> with your solution ? ( using this : https://docs.databricks.com/ >>> spark/latest/spark-sql/udaf-scala.html ? >>> >>> >>> >>> Thanks again, >>> >>> Regards, >>> >>> >>> Julien >>> >>> >>> >>> Le mer. 5 juil. 2017 à 19:18, Radhwane Chebaane < >>> r.cheba...@mindlytix.com> a écrit : >>> >>>> Hi Julien, >>>> >>>> >>>> Although this is a strange bug in Spark, it's rare to need more than >>>> Integer max value size for a window. >>>> >>>> Nevertheless, most of the window functions can be expressed with >>>> self-joins. Hence, your problem may be solved with this example: >>>> >>>> If input data as follow: >>>> >>>> +---+-------------+-----+ >>>> | id| timestamp|value| >>>> +---+-------------+-----+ >>>> | B|1000000000000| 100| >>>> | B|1001000000000| 50| >>>> | B|1002000000000| 200| >>>> | B|2500000000000| 500| >>>> +---+-------------+-----+ >>>> >>>> And the window is (-2000000000L, 0) >>>> >>>> Then this code will give the wanted result: >>>> >>>> df.as("df1").join(df.as("df2"), >>>> $"df2.timestamp" between($"df1.timestamp" - 2000000000L, >>>> $"df1.timestamp")) >>>> .groupBy($"df1.id", $"df1.timestamp", $"df1.value") >>>> .agg( functions.min($"df2.value").as("min___value")) >>>> .orderBy($"df1.timestamp") >>>> .show() >>>> >>>> +---+-------------+-----+-----------+ >>>> | id| timestamp|value|min___value| >>>> +---+-------------+-----+-----------+ >>>> | B|1000000000000| 100| 100| >>>> | B|1001000000000| 50| 50| >>>> | B|1002000000000| 200| 50| >>>> | B|2500000000000| 500| 500| >>>> +---+-------------+-----+-----------+ >>>> >>>> Or by SparkSQL: >>>> >>>> SELECT c.id as id, c.timestamp as timestamp, c.value, min(c._value) as >>>> min___value FROM >>>> ( >>>> SELECT a.id as id, a.timestamp as timestamp, a.value as value, >>>> b.timestamp as _timestamp, b.value as _value >>>> FROM df a CROSS JOIN df b >>>> ON b.timestamp >= a.timestamp - 2000000000L and b.timestamp <= >>>> a.timestamp >>>> ) c >>>> GROUP BY c.id, c.timestamp, c.value ORDER BY c.timestamp >>>> >>>> >>>> This must be also possible also on Spark Streaming however don't expect >>>> high performance. >>>> >>>> >>>> Cheers, >>>> Radhwane >>>> >>>> >>>> >>>> 2017-07-05 10:41 GMT+02:00 Julien CHAMP <jch...@tellmeplus.com>: >>>> >>>>> Hi there ! >>>>> >>>>> Let me explain my problem to see if you have a good solution to help >>>>> me :) >>>>> >>>>> Let's imagine that I have all my data in a DB or a file, that I load >>>>> in a dataframe DF with the following columns : >>>>> *id | timestamp(ms) | value* >>>>> A | 1000000 | 100 >>>>> A | 1000010 | 50 >>>>> B | 1000000 | 100 >>>>> B | 1000010 | 50 >>>>> B | 1000020 | 200 >>>>> B | 2500000 | 500 >>>>> C | 1000000 | 200 >>>>> C | 1000010 | 500 >>>>> >>>>> The timestamp is a *long value*, so as to be able to express date in >>>>> ms from 0000-01-01 to today ! >>>>> >>>>> I want to compute operations such as min, max, average on the *value >>>>> column*, for a given window function, and grouped by id ( Bonus : if >>>>> possible for only some timestamps... ) >>>>> >>>>> For example if I have 3 tuples : >>>>> >>>>> id | timestamp(ms) | value >>>>> B | 1000000 | 100 >>>>> B | 1000010 | 50 >>>>> B | 1000020 | 200 >>>>> B | 2500000 | 500 >>>>> >>>>> I would like to be able to compute the min value for windows of time = >>>>> 20. This would result in such a DF : >>>>> >>>>> id | timestamp(ms) | value | min___value >>>>> B | 1000000 | 100 | 100 >>>>> B | 1000010 | 50 | 50 >>>>> B | 1000020 | 200 | 50 >>>>> B | 2500000 | 500 | 500 >>>>> >>>>> This seems the perfect use case for window function in spark ( cf : >>>>> https://databricks.com/blog/2015/07/15/introducing-window- >>>>> functions-in-spark-sql.html ) >>>>> I can use : >>>>> >>>>> Window.orderBy("timestamp").partitionBy("id").rangeBetween(-20,0) >>>>> df.withColumn("min___value", min(df.col("value")).over(tw)) >>>>> >>>>> This leads to the perfect answer ! >>>>> >>>>> However, there is a big bug with window functions as reported here ( >>>>> https://issues.apache.org/jira/browse/SPARK-19451 ) when working with >>>>> Long values !!! So I can't use this.... >>>>> >>>>> So my question is ( of course ) how can I resolve my problem ? >>>>> If I use spark streaming I will face the same issue ? >>>>> >>>>> I'll be glad to discuss this problem with you, feel free to answer :) >>>>> >>>>> Regards, >>>>> >>>>> Julien >>>>> -- >>>>> >>>>> >>>>> Julien CHAMP — Data Scientist >>>>> >>>>> >>>>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : >>>>> **jch...@tellmeplus.com >>>>> <jch...@tellmeplus.com>* >>>>> >>>>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* : *here* >>>>> <https://www.linkedin.com/in/julienchamp> >>>>> >>>>> TellMePlus S.A — Predictive Objects >>>>> >>>>> *Paris* : 7 rue des Pommerots, 78400 Chatou >>>>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière >>>>> >>>>> >>>>> Ce message peut contenir des informations confidentielles ou couvertes >>>>> par le secret professionnel, à l’intention de son destinataire. Si vous >>>>> n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en >>>>> supprimer toute copie. >>>>> This email may contain confidential and/or privileged information for >>>>> the intended recipient. If you are not the intended recipient, please >>>>> contact the sender and delete all copies. >>>>> >>>>> >>>>> <http://www.tellmeplus.com/assets/emailing/banner.html> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> [image: photo] Radhwane Chebaane >>>> Distributed systems engineer, Mindlytix >>>> >>>> Mail: radhw...@mindlytix.com <radhw...@mindlytix.com> >>>> Mobile: +33 695 588 906 <+33+695+588+906> >>>> <https://mail.google.com/mail/u/0/#> >>>> Skype: rad.cheb <https://mail.google.com/mail/u/0/#> >>>> LinkedIn <https://fr.linkedin.com/in/radhwane-chebaane-483b3a7b> >>>> <https://mail.google.com/mail/u/0/#> >>>> >>> -- >>> >>> >>> Julien CHAMP — Data Scientist >>> >>> >>> *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : >>> **jch...@tellmeplus.com >>> <jch...@tellmeplus.com>* >>> >>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* : *here* >>> <https://www.linkedin.com/in/julienchamp> >>> >>> TellMePlus S.A — Predictive Objects >>> >>> *Paris* : 7 rue des Pommerots, 78400 Chatou >>> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière >>> >>> >>> Ce message peut contenir des informations confidentielles ou couvertes >>> par le secret professionnel, à l’intention de son destinataire. Si vous >>> n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en >>> supprimer toute copie. >>> This email may contain confidential and/or privileged information for >>> the intended recipient. If you are not the intended recipient, please >>> contact the sender and delete all copies. >>> >>> >>> <http://www.tellmeplus.com/assets/emailing/banner.html> >>> >> >> >> >> -- >> >> [image: photo] Radhwane Chebaane >> Distributed systems engineer, Mindlytix >> >> Mail: radhw...@mindlytix.com <radhw...@mindlytix.com> >> Mobile: +33 695 588 906 <+33+695+588+906> >> <https://mail.google.com/mail/u/0/#> >> Skype: rad.cheb <https://mail.google.com/mail/u/0/#> >> LinkedIn <https://fr.linkedin.com/in/radhwane-chebaane-483b3a7b> >> <https://mail.google.com/mail/u/0/#> >> > -- > > > Julien CHAMP — Data Scientist > > > *Web : **www.tellmeplus.com* <http://tellmeplus.com/> — *Email : > **jch...@tellmeplus.com > <jch...@tellmeplus.com>* > > *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* : *here* > <https://www.linkedin.com/in/julienchamp> > > TellMePlus S.A — Predictive Objects > > *Paris* : 7 rue des Pommerots, 78400 Chatou > *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière > > > Ce message peut contenir des informations confidentielles ou couvertes par > le secret professionnel, à l’intention de son destinataire. Si vous n’en > êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer > toute copie. > This email may contain confidential and/or privileged information for the > intended recipient. If you are not the intended recipient, please contact > the sender and delete all copies. > > > <http://www.tellmeplus.com/assets/emailing/banner.html> > -- [image: photo] Radhwane Chebaane Distributed systems engineer, Mindlytix Mail: radhw...@mindlytix.com <radhw...@mindlytix.com> Mobile: +33 695 588 906 <+33+695+588+906> <https://mail.google.com/mail/u/0/#> Skype: rad.cheb <https://mail.google.com/mail/u/0/#> LinkedIn <https://fr.linkedin.com/in/radhwane-chebaane-483b3a7b> <https://mail.google.com/mail/u/0/#>