Thanks for your interest in Apache Spark Structured Streaming! There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ). Also I think the visualizations based on metrics output by the StreamingQueryListener <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html> are still being rolled out, but should be available everywhere soon.
First, I set two options to make sure that files were read one at a time, thus allowing us to see incremental results. spark.readStream .option("maxFilesPerTrigger", "1") .option("latestFirst", "true") ... There is more detail on how these options work in this post <https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html> . Regarding continually updating result of a streaming query using display(df)for streaming DataFrames (i.e. one created with spark.readStream), that has worked in Databrick's since Spark 2.1. The longer form example we published requires you to rerun the count to see it change at the end of the notebook because that is not a streaming query. Instead it is a batch query over data that has been written out by another stream. I'd like to add the ability to run a streaming query from data that has been written out by the FileSink (tracked here SPARK-19633 <https://issues.apache.org/jira/browse/SPARK-19633>). In the demo, I started two different streaming queries: - one that reads from json / kafka => writes to parquet - one that reads from json / kafka => writes to memory sink <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks> / pushes latest answer to the js running in a browser using the StreamingQueryListener <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>. This is packaged up nicely in display(), but there is nothing stopping you from building something similar with vanilla Apache Spark. Michael On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hey folks > > This one is mainly aimed at the databricks folks, I have been trying to > replicate the cloudtrail demo > <https://www.youtube.com/watch?v=IJmFTXvUZgY> Micheal did at Spark > Summit. The code for it can be found here > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html> > > My question is how did you get the results to be displayed and updated > continusly in real time > > I am also using databricks to duplicate it but I noticed the code link > mentions > > "If you count the number of rows in the table, you should find the value > increasing over time. Run the following every few minutes." > This leads me to believe that the version of Databricks that Micheal was > using for the demo is still not released, or at-least the functionality to > display those changes in real time aren't > > Is this the case? or am I completely wrong? > > Can I display the results of a structured streaming query in realtime > using the databricks "display" function? > > > Regards > Sam >