PySpark Connect client hangs after query completes

2024-10-24 Thread peay
Hello, I have observed that for long queries (say > 1 hour) from pyspark with Spark Connect (3.5.2), the client often gets stuck: even when the query completes successfully, the client stays waiting in _execute_and_fetch_as_iterator​ (see traceback below). The client session still shows as act

Spark lists paths after `write` - how to avoid refreshing the file index?

2019-02-14 Thread peay
Hello, I have a piece of code that looks roughly like this: df = spark.read.parquet("s3://bucket/data.parquet/name=A", "s3://bucket/data.parquet/name=B") df_out = df. # Do stuff to transform df df_out.write.partitionBy("name").parquet("s3://bucket/data.parquet") I specific explicit path

Re: Watermarking without aggregation with Structured Streaming

2018-09-30 Thread peay
d running code though and would like to know if someone can >> shed more light on this. >> >> Regards, >> Chandan >> >> On Sat, Sep 22, 2018 at 7:43 PM peay wrote: >> >>> Hello, >>> >>> I am trying to use watermarking withou

Watermarking without aggregation with Structured Streaming

2018-09-22 Thread peay
Hello, I am trying to use watermarking without aggregation, to filter out records that are just too late, instead of appending them to the output. My understanding is that aggregation is required for `withWatermark` to have any effect. Is that correct? I am looking for something along the line

Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-18 Thread peay
Hello, I run a Spark cluster on YARN, and we have a bunch of client-mode applications we use for interactive work. Whenever we start one of this, an application master container is started. My understanding is that this is mostly an empty shell, used to request further containers or get status

Saving dataframes with partitionBy: append partitions, overwrite within each

2017-09-29 Thread peay
Hello, I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") to write a dataset while splitting by day. I would like to run a Spark job to process, e.g., a month: dataset.parquet/day=2017-01-01/... ... and then run another Spark job to add another month using the same

Configuration for unit testing and sql.shuffle.partitions

2017-09-12 Thread peay
Hello, I am running unit tests with Spark DataFrames, and I am looking for configuration tweaks that would make tests faster. Usually, I use a local[2] or local[4] master. Something that has been bothering me is that most of my stages end up using 200 partitions, independently of whether I rep

Re: map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-04 Thread peay
/foreachRDD equivalent for pyspark Structured Streaming Local Time: 3 May 2017 12:05 PM UTC Time: 3 May 2017 10:05 From: tathagata.das1...@gmail.com To: peay user@spark.apache.org You can apply apply any kind of aggregation on windows. There are some built in aggregations (e.g. sum and count) as well as

map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-03 Thread peay
Hello, I would like to get started on Spark Streaming with a simple window. I've got some existing Spark code that takes a dataframe, and outputs a dataframe. This includes various joins and operations that are not supported by structured streaming yet. I am looking to essentially map/apply thi