Re: Structured Streaming Spark Summit Demo - Databricks people

Michael Armbrust Thu, 16 Feb 2017 12:30:44 -0800

Thanks for your interest in Apache Spark Structured Streaming!

There is nothing secret in that demo, though I did make some configuration
changes in order to get the timing right (gotta have some dramatic effect
:) ).  Also I think the visualizations based on metrics output by the
StreamingQueryListener
<https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>
are
still being rolled out, but should be available everywhere soon.

First, I set two options to make sure that files were read one at a time,
thus allowing us to see incremental results.

spark.readStream
  .option("maxFilesPerTrigger", "1")
  .option("latestFirst", "true")
...

There is more detail on how these options work in this post
<https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html>
.

Regarding continually updating result of a streaming query using display(df)for
streaming DataFrames (i.e. one created with spark.readStream), that has
worked in Databrick's since Spark 2.1.  The longer form example we
published requires you to rerun the count to see it change at the end of
the notebook because that is not a streaming query. Instead it is a batch
query over data that has been written out by another stream.  I'd like to
add the ability to run a streaming query from data that has been written
out by the FileSink (tracked here SPARK-19633
<https://issues.apache.org/jira/browse/SPARK-19633>).

In the demo, I started two different streaming queries:
 - one that reads from json / kafka => writes to parquet
 - one that reads from json / kafka => writes to memory sink
<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks>
/ pushes latest answer to the js running in a browser using the
StreamingQueryListener
<https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>.
This is packaged up nicely in display(), but there is nothing stopping you
from building something similar with vanilla Apache Spark.

Michael

On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <hussam.ela...@gmail.com>
wrote:

> Hey folks
>
> This one is mainly aimed at the databricks folks, I have been trying to
> replicate the cloudtrail demo
> <https://www.youtube.com/watch?v=IJmFTXvUZgY> Micheal did at Spark
> Summit. The code for it can be found here
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html>
>
> My question is how did you get the results to be displayed and updated
> continusly in real time
>
> I am also using databricks to duplicate it but I noticed the code link
> mentions
>
>  "If you count the number of rows in the table, you should find the value
> increasing over time. Run the following every few minutes."
> This leads me to believe that the version of Databricks that Micheal was
> using for the demo is still not released, or at-least the functionality to
> display those changes in real time aren't
>
> Is this the case? or am I completely wrong?
>
> Can I display the results of a structured streaming query in realtime
> using the databricks "display" function?
>
>
> Regards
> Sam
>

Re: Structured Streaming Spark Summit Demo - Databricks people

Reply via email to