It's been a few years (so this approach might be out of date) but here's
what I used for PySpark as part of this SO (
https://stackoverflow.com/questions/45717433/stop-structured-streaming-query-gracefully/65708677
)
```
# Helper method to stop a streaming query
def stop_stream_query(query, wait_
Coming in late.. but if I understand correctly, you can simply use the fact
that spark.read (or readStream) will also accept a directory argument. If
you provide a directory spark will automagically pull in all the files in
that directory.
"""Reading in multiple files example"""
spark =
SparkSess
Hi All,
My google/SO searching is somehow failing on this I simply want to compute
histograms for a column in a Spark dataframe.
There are two SO hits on this question:
-
https://stackoverflow.com/questions/39154325/pyspark-show-histogram-of-a-data-frame-column
-
https://stackoverflow.com/questio
@vermanuraq
Great thanks, just what I needed.. I knew I was missing something simple.
Cheers,
-brian
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.or
ot;ts").cast("timestamp")*
>
> On Wed, Aug 30, 2017 at 11:45 AM, Brian Wylie
> wrote:
>
>> Hi All,
>>
>> I'm using structured streaming in Spark 2.2.
>>
>> I'm using PySpark and I have data (from a Kafka publisher) where the
>>
# Then a writestream later...
Okay so all this code works fine (the 'dt' field has exactly what I
want)... but I'll be streaming in a lot of data so here's the questions:
- Will the creation of a new dataframe withColumn basically kill
performance?
- Should I move my UDF into the parsed_data.select(...) part?
- Can my UDF be done by spark.sql directly? (I tried to_timestamp but
without luck)
Any suggestions/pointers are greatly appreciated.
-Brian Wylie
options.
Cheers and thanks again.
-Brian
On Wed, Aug 23, 2017 at 4:51 PM, Shixiong(Ryan) Zhu wrote:
> You can use `bin/pyspark --packages
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0`
> to start "pyspark". If you want to use "spark-submit", you also need to
&g
Aug 23, 2017 at 4:51 PM, Shixiong(Ryan) Zhu wrote:
> You can use `bin/pyspark --packages
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0`
> to start "pyspark". If you want to use "spark-submit", you also need to
> provide your Python file.
>
> On Wed, Au
t;main" java.lang.IllegalArgumentException: Missing
application resource. at
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:274)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
at org.apache.spark.launcher.Main.main(Main.java:86)
Anyway, all my code/versions/etc are in this notebook:
-
https://github.com/Kitware/BroThon/blob/master/notebooks/Bro_to_Spark.ipynb
I'd be tremendously appreciative of some super nice, smart person if they
could point me in the right direction :)
-Brian Wylie
t to
> read bro logs, rather than a python library. This is likely to have much
> better performance since we can do all of the parsing on the JVM without
> having to flow it though an external python process.
>
> On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie
> wrote:
>
>
Hi All,
I've read the new information about Structured Streaming in Spark, looks
super great.
Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://databricks.com/blog/2016/07/28/structured-streamin
g-in-apache-spark.html
- https://spark.a
Hi All,
I've read the new information about Structured Streaming in Spark, looks
super great.
Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
-
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
- https://spark.ap
12 matches
Mail list logo