Replacing groupBykey() with reduceByKey()

2018-08-03 Thread Bathi CCDB
I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and python newbie and I am having a hard time figuring out the lambda function for the reduceByKey() operation. Here is the code dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).take(2) Here

Does row_number over a window cause a shuffle?

2018-08-03 Thread Jayesh Lalwani
I have some code that adds a column that contains a row_number over a window. It looks somewhat like this val sortColumns: List[Column] = r.sortFields.map(sf => sf.map(col(_))).getOrElse(List(col(s"defaultSortCol"))) val partitionWindow = Window.partitionBy(s"groupByCol") val window = partitionWin

Re: Machine Learning with window data

2018-08-03 Thread Robb Greathouse
I keep unsubscribing from this list; but continue to receive emails. On Fri, Aug 3, 2018 at 4:01 AM Christiaan Ras wrote: > Hi, > > > > I have a use case where I like to analyze windows of sensordata. > > Currently I have a working case where I use Structured Streaming to > process real-time str

How does readStream() and writeStream() work?

2018-08-03 Thread dddaaa
Hi I'm wondering how does readStream() and writeStream() work internally Lets take a simple example: df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_brokers) \ .option("subscribe", kafka_topic) \ .l

Machine Learning with window data

2018-08-03 Thread Christiaan Ras
Hi, I have a use case where I like to analyze windows of sensordata. Currently I have a working case where I use Structured Streaming to process real-time streams of sensordata. Now I like to analyse windows of sensordata and use classification to predict the class of a whole window. For instanc