Recompute Spark outputs intelligently

2017-12-15 Thread Ashwin Raju
Hi, We have a batch processing application that reads logs files over multiple days, does transformations and aggregations on them using Spark and saves various intermediate outputs to Parquet. These jobs take many hours to run. This pipeline is deployed at many customer sites with some site speci

Re: Spark 2.2 streaming with append mode: empty output

2017-08-15 Thread Ashwin Raju
x27;: {u'description': u'org.apache.spark.sql.execution.streaming.ConsoleSink@7e4050cd'}} On Mon, Aug 14, 2017 at 4:55 PM, Tathagata Das wrote: > In append mode, the aggregation outputs a row only when the watermark has > been crossed and the corresponding aggregate is

Spark 2.2 streaming with append mode: empty output

2017-08-14 Thread Ashwin Raju
Hi, I am running Spark 2.2 and trying out structured streaming. I have the following code: from pyspark.sql import functions as F df=frame \ .withWatermark("timestamp","1 minute") \ .groupby(F.window("timestamp","1 day"),*groupby_cols) \ .agg(f.sum('bytes')) query = frame.writeSt

Reusing dataframes for streaming (spark 1.6)

2017-08-08 Thread Ashwin Raju
Hi, We've built a batch application on Spark 1.6.1. I'm looking into how to run the same code as a streaming (DStream based) application. This is using pyspark. In the batch application, we have a sequence of transforms that read from file, do dataframe operations, then write to file. I was hopin