Hi,
We have a batch processing application that reads logs files over multiple
days, does transformations and aggregations on them using Spark and saves
various intermediate outputs to Parquet. These jobs take many hours to run.
This pipeline is deployed at many customer sites with some site speci
x27;:
{u'description':
u'org.apache.spark.sql.execution.streaming.ConsoleSink@7e4050cd'}}
On Mon, Aug 14, 2017 at 4:55 PM, Tathagata Das
wrote:
> In append mode, the aggregation outputs a row only when the watermark has
> been crossed and the corresponding aggregate is
Hi,
I am running Spark 2.2 and trying out structured streaming. I have the
following code:
from pyspark.sql import functions as F
df=frame \
.withWatermark("timestamp","1 minute") \
.groupby(F.window("timestamp","1 day"),*groupby_cols) \
.agg(f.sum('bytes'))
query = frame.writeSt
Hi,
We've built a batch application on Spark 1.6.1. I'm looking into how to run
the same code as a streaming (DStream based) application. This is using
pyspark.
In the batch application, we have a sequence of transforms that read from
file, do dataframe operations, then write to file. I was hopin