Re: Best way to calculate intermediate column statistics

2016-08-26 Thread Mich Talebzadeh
Hi Bedrytski, I assume you are referring to my code above. The alternative SQL would be (the first code with rank) SELECT * FROM ( SELECT transactiondate, transactiondescription, debitamount , RANK() OVER (ORDER BY transactiondate desc) AS rank FROM WHERE transactiondescriptio

Re: Best way to calculate intermediate column statistics

2016-08-26 Thread Bedrytski Aliaksandr
Hi Mich, I was wondering what are the advantages of using helper methods instead of one SQL multiline string? (I rarely (if ever) use helper methods, but maybe I'm missing something) Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh wrote: > H

Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Mich Talebzadeh
Hi Richard, Windowing/Analytics for stats are pretty simple. Example import org.apache.spark.sql.expressions.Window val wSpec = Window.partitionBy('transactiontype).orderBy(desc("transactiondate")) df.filter('transactiondescription.contains(HASHTAG)).select('transactiondate,'transactiondescriptio

Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Richard Siebeling
Hi Mich, thanks for the suggestion, I hadn't thought of that. We'll need to gather the statistics in two ways, incremental when new data arrives and over the complete set when aggregating or filtering (because I think it's difficult to gather statistics while aggregating or filtering). The analyti

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Mich Talebzadeh
Hi Richard, can you use analytics functions for this purpose on DF HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.w

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
Hi Mich, I'd like to gather several statistics per column in order to make analysing data easier. These two statistics are some examples, other statistics I'd like to gather are the variance, the median, several percentiles, etc. We are building a data analysis platform based on Spark, kind rega

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Mich Talebzadeh
Hi Richard, What is the business use case for such statistics? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordp

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Bedrytski Aliaksandr
Hi Richard, these intermediate statistics should be calculated from the result of the calculation or during the aggregation? If they can be derived from the resulting dataframe, why not to cache (persist) that result just after the calculation? Then you may aggregate statistics from the cached dat