Fw: Reg: Supporting inheritance for datatypes in pyspark

2025-05-02 Thread Vaibhaw
Hi! Currently when creating DataFrame, we cannot use a class that extends StructType as schema. This is due to the type_verifier using `type(obj) == StructType`. If this check is replaced with a `isinstance`, the extended types can be supported. Here is a small test that fails currently: > im

Comparison between union and stack in pyspark

2025-04-16 Thread Dhruv Singla
Hey Everyone In Spark, suppose i have the following df. ``` df = spark.createDataFrame([['A', 'A06', 'B', 'B02', '202412'], ['A', 'A04', 'B', 'B03', '202501'], ['B', 'B01', 'C', 'C02', '202411'], ['B', 'B03', 'A', 'A06', '202502']], 'entity_code: string, entity_rollup: string, target_entity_code:

Parallelism for glue pyspark jobs

2025-04-15 Thread Perez
Hi Everyone, I am facing one issue. The problem is explained in detail in the below SO post. Any suggestions would be appreciated. https://stackoverflow.com/questions/79574599/unable-to-configure-the-exact-number-of-dpus-for-the-glue-pyspark-job Thanks

The use of Python ParamSpec in PySpark

2025-04-08 Thread Rafał Wojdyła
Hi, I wanted to highlight the usefulness of now closed (unmerged) PR `[SPARK-49008][PYTHON] Use ParamSpec to propagate func signature in transform` - https://github.com/apache/spark/pull/47493. This change would add type-checking for the DataFrame `transform` method in PySpark using Python

Re: Multiple CVE issues in apache/spark-py:3.4.0 + Pyspark 3.4.0

2025-03-15 Thread Soumasish
Sat, Mar 15, 2025 at 1:17 AM Mohammad, Ejas Ali wrote: > Hi Spark Community, > > > > I am using the official Docker image `apache/spark-py:v3.4.0` and > installing `pyspark==3.4.0` on top of it. However, I have encountered > multiple security vulnerabilities related to out

Multiple CVE issues in apache/spark-py:3.4.0 + Pyspark 3.4.0

2025-03-15 Thread Mohammad, Ejas Ali
Hi Spark Community, I am using the official Docker image `apache/spark-py:v3.4.0` and installing `pyspark==3.4.0` on top of it. However, I have encountered multiple security vulnerabilities related to outdated dependencies in the base image. Issues: 1. Security Concerns: - Prisma scan

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Bjørn Jørgensen
ot simpler and easier to understand, but >> I'm looking if there is already a function or a spark built-in for this. >> Thanks for the help though. >> >> On Sun, Mar 9, 2025 at 11:42 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> import

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
but > I'm looking if there is already a function or a spark built-in for this. > Thanks for the help though. > > On Sun, Mar 9, 2025 at 11:42 PM Mich Talebzadeh > wrote: > >> import pyspark >> from pyspark import SparkConf, SparkContext >> from pyspark

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Dhruv Singla
Yes, this is it. I want to form this using a simple short command. The way I mentioned is a lengthy one. On Sun, Mar 9, 2025 at 10:16 PM Mich Talebzadeh wrote: > Is this what you are expecting? > > root > |-- code: integer (nullable = true) > |-- AB_amnt: long (nullable = true) > |-- AA_amnt:

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Dhruv Singla
zadeh wrote: > import pyspark > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql import SQLContext > from pyspark.sql.functions import struct > from pyspark.sql import functions as F > from pyspark.sql.types import StructType

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
import pyspark from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext from pyspark.sql.functions import struct from pyspark.sql import functions as F from pyspark.sql.types import StructType, StructField, IntegerType, StringType

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
Is this what you are expecting? root |-- code: integer (nullable = true) |-- AB_amnt: long (nullable = true) |-- AA_amnt: long (nullable = true) |-- AC_amnt: long (nullable = true) |-- load_date: date (nullable = true) ++---+---+---+--+ |code|AB_amnt|AA_amnt|AC_amnt|l

Apply pivot only on some columns in pyspark

2025-03-09 Thread Dhruv Singla
Hi Everyone Hope you are doing well I have the following dataframe. df = spark.createDataFrame( [ [1, 'AB', 12, '2022-01-01'] , [1, 'AA', 22, '2022-01-10'] , [1, 'AC', 11, '2022-01-11'] , [2, 'AB', 22, '2022-02-01'] , [2, 'AA', 28, '2022-02-10']

Re: Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie

2025-02-06 Thread Aaron Grubb
Someone just replied to the bug, it was already known about and will be fixed in the upcoming Iceberg 1.7.2 release. On Thu, 2025-02-06 at 09:35 +, Aaron Grubb wrote: > Hi all, > > I filed a bug with the Iceberg team [1] but I'm not sure that it's 100% > specific to Iceberg (I assume it is

Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie

2025-02-06 Thread Aaron Grubb
Hi all, I filed a bug with the Iceberg team [1] but I'm not sure that it's 100% specific to Iceberg (I assume it is as data in the related parquet file is correct and session.read.parquet always returns correct results) so I figured I would flag it here in case anyone has some insight. Currently

Re: AWS Glue PySpark Job

2025-01-04 Thread Perez
Hi Team, I would appreciate any help with this. https://stackoverflow.com/questions/79324390/aws-glue-pyspark-job-is-not-ending/79324917#79324917 On Fri, Jan 3, 2025 at 3:53 PM Perez wrote: > Hi Team, > > I would need your help in understanding the below problem. &g

AWS Glue PySpark Job

2025-01-03 Thread Perez
Hi Team, I would need your help in understanding the below problem. https://stackoverflow.com/questions/79324390/aws-glue-pyspark-job-is-not-ending/79324917#79324917

StreamingQueryListener in PySpark lagging behind

2024-10-31 Thread Andrzej Zera
Hello, I have an issue with StreamingQueryListener in my Structured Streaming application written in PySpark. I'm running around 8 queries, and each query runs every 5-20 seconds. In total, I have around ~40 microbatch execution per minute. I set up Python's StreamingQueryListener

PySpark Connect client hangs after query completes

2024-10-24 Thread peay
Hello, I have observed that for long queries (say > 1 hour) from pyspark with Spark Connect (3.5.2), the client often gets stuck: even when the query completes successfully, the client stays waiting in _execute_and_fetch_as_iterator​ (see traceback below). The client session still shows

Re: unable to deploy Pyspark application on GKE, Spark installed using bitnami helm chart

2024-08-27 Thread Mat Schaffer
invoke(Gateway.java:238) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.base/java.lang.

unable to deploy Pyspark application on GKE, Spark installed using bitnami helm chart

2024-08-26 Thread karan alang
GKE/k8s (Note - need to install on both GKE and On-prem kubernetes) 2. How to submit pyspark jobs on Spark cluster deployed on GKE (eg. Do I need to create a K8s deployment for each spark job ?) tia ! Here is the stackoverflow link : https://stackoverflow.com/questions/78915988/unable-to

Issue with pyspark : Add custom shutdown hook

2024-08-16 Thread aarushi agarwal
Hi Team, I am trying to add a shutdown hook with the pyspark script using `*atexit*`. However, it seems like whenever I send a SIGTERM to the spark-submit process, it triggers the JVM shutdown hook first which results in terminating the spark context. I didn't understand in what order the p

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Meena Rajani
mbda x: x * x) 10 # > Persist the DataFrame in memory 11 > #squared_rdd.persist(StorageLevel.MEMORY_ONLY) 12 # Collect the results > into a list---> 13 result = squared_rdd.collect() 15 # Print the result > 16 print(result) > > File > C:\spark\spark-3.5

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike, This appears to be an access issue on Windows + Python. Can you try setting up the PYTHON_PATH environment variable as described in this stackoverflow post https://stackoverflow.com/questions/60414394/createprocess-error-5-access-is-denied-pyspark - Sadha On Mon, Jul 29, 2024 at 3

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread mike Jadoo
ult 16 print(result) File C:\spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\rdd.py:1833, in RDD.collect(self) 1831 with SCCallSiteSync(self.context): 1832 assert self.ctx._jvm is not None-> 1833 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike, I'm not sure about the minimum requirements of a machine for running Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local machine, I found the following steps are the easiest. I installed Amazon corretto and updated the java_home variable as instructed here

Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread mike Jadoo
Hello, I am trying to run Pyspark on my computer without success. I follow several different directions from online sources and it appears that I need to get a faster computer. I wanted to ask what are some recommendations for computer specifications to run PySpark (Apache Spark). Any help

Assistance needed with PySpark Streaming for retrieving past messages.

2024-07-02 Thread Mai Trang
Dear Spark teams. I hope this email finds you well. I am currently working with PySpark Streaming and need to implement a feature where, when a message matches a certain filter condition, I can retrieve past messages some small amount of time *before* that message arrives. Is there a more

Pyspark DataFrame.drop wrong type hints

2024-06-21 Thread Oliver Beagley
Hi there, I believe I have found an error with the type hints for `DataFrame.drop` in pyspark. The first overload at https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/dataframe.py#L5559-L5568 isn’t as a `*args` argument, and therefore doesn’t allow

Pyspark DataFrame.drop wrong type hints

2024-06-21 Thread Oliver Beagley
Hi there, I believe I have found an error with the type hints for `DataFrame.drop` in pyspark. The first overload at https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/dataframe.py#L5559-L5568 isn’t as a `*args` argument, and therefore doesn’t allow

Re: Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-09 Thread Perez
Hi Team, Any help in this matter would be greatly appreciated. TIA On Sun, Jun 9, 2024 at 11:26 AM Perez wrote: > Hi Team, > > this is the problem > https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue > > I can't go

Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-08 Thread Perez
Hi Team, this is the problem https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue I can't go ahead with *StructType* approach since my input record is huge and if the underlying attributes are added or removed my code might fail. I can

Tox and Pyspark

2024-05-28 Thread Perez
Hi Team, I need help with this https://stackoverflow.com/questions/78547676/tox-with-pyspark

Re: pyspark dataframe join with two different data type

2024-05-16 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["c

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ | a

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in t

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values of

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ] column_na

Traceback is missing content in pyspark when invoked with UDF

2024-05-01 Thread Indivar Mishra
error > Traceback (most recent call last): > > File "temp.py", line 28, in > > fun() > > File "temp.py", line 25, in fun > > df.select(col("Seqno"), >> errror_func_udf(col("Name")).alias("Name"

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
and they cannot focus for long hours. I let him explore Python on his >> Windows 10 laptop and download it himself. In the following video Christian >> explains to his mother what he started to do just before going to bed. BTW, >> when he says 32M he means 32-bit. I leave it t

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
focus for long hours. I let him explore Python on his > Windows 10 laptop and download it himself. In the following video Christian > explains to his mother what he started to do just before going to bed. BTW, > when he says 32M he means 32-bit. I leave it to you to judge :) Now the > i

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Varun Shah
Hi @Mich Talebzadeh , community, Where can I find such insights on the Spark Architecture ? I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3. https://stackoverflow.com/questions/3

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually res

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as pipelin

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation Op

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark applies

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, 'course_title': 'Masteri

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
d of the `get` method which is used for dictionaries. > > But weirdly, for query.lastProgress and query.recentProgress, they should > return StreamingQueryProgress but instead they returned a dict. So the > `get` method works there. > > I think PySpark should improve on this part. > > Mich Ta

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
, for query.lastProgress and query.recentProgress, they should return StreamingQueryProgress but instead they returned a dict. So the `get` method works there. I think PySpark should improve on this part. Mich Talebzadeh 于2024年3月11日周一 05:51写道: > Hi, > > Thank you for your advice > > This

Data ingestion into elastic failing using pyspark

2024-03-11 Thread Karthick Nk
Hi @all, I am using pyspark program to write the data into elastic index by using upsert operation (sample code snippet below). def writeDataToES(final_df): write_options = { "es.nodes": elastic_host, "es.net.ssl": "false", "es.nodes.wan.only&q

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi, Thank you for your advice This is the amended code def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received #print(microbatc

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not 刘唯 于2024年3月10日周日 22:04写道: > Have you tried using microbatch_data.get("processedRowsPerSecond")? > Camel case now snake case > > Mich Talebzadeh 于2024年3月10日周日 11:46写道: > >> >> There is a paper from Databricks on this subject >> >> >> https://www.databricks.com/blog/2022/05/27/how-to-

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")? Camel case now snake case Mich Talebzadeh 于2024年3月10日周日 11:46写道: > > There is a paper from Databricks on this subject > > > https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html > > But havi

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
There is a paper from Databricks on this subject https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html But having tested it, there seems to be a bug there that I reported to Databricks forum as well (in answer to a user question) I have come to a conclusion

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
e.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) >> at >> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) >> at >> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) >> >> My assumption is that its trying to look on my local machine for >> /data/hive/warehouse and failing because on the remote box I can see those >> folders. >> >> So the question is, if you're not backing it with hadoop or something do >> you have to mount the drive in the same place on the computer running the >> pyspark? Or am I missing a config option somewhere? >> >> Thanks! >> >

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my as

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
educeCommitProtocol.scala:192) > > My assumption is that its trying to look on my local machine for > /data/hive/warehouse and failing because on the remote box I can see those > folders. > > So the question is, if you're not backing it with hadoop or something do > you have to mount the drive in the same place on the computer running the > pyspark? Or am I missing a config option somewhere? > > Thanks! >

Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
uceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) My assumption is that its trying to look on my local machine for /data/hive/warehouse and failing because on the remote box I can see those folders. So the question is, if you're not backing it with hadoop or something do you have to mount the drive in the same place on the computer running the pyspark? Or am I missing a config option somewhere? Thanks!

Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format test.ft.txt.bz2.zip So it is a text file that is compressed by bz2 followed by zip. Now I like tro do all these operations in PySpark. In

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark = SparkSession.bui

pyspark dataframe join with two different data type

2024-02-29 Thread Karthick Nk
Hi All, I have two dataframe with below structure, i have to join these two dataframe - the scenario is one column is string in one dataframe and in other df join column is array of string, so we have to inner join two df and get the data if string value is present in any of the array of string va

Best option to process single kafka stream in parallel: PySpark Vs Dask

2024-01-11 Thread lab22
I am creating a setup to process packets from single kafta topic in parallel. For example, I have 3 containers (let's take 4 cores) on one vm, and from 1 kafka topic stream I create 10 jobs depending on packet source. These packets have small workload. 1. I can install dask in each containe

Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
ingestion and analytics. My use case revolves around a scenario where data is generated through REST API requests in real time with Pyspark.. The Flask REST API efficiently captures and processes this data, saving it to a sync of your choice like a data warehouse or kafka. HTH Mich Talebzadeh, Dad

Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович > wrote: > >> Yes, it's actual data. >> >> >> >> Best regards, >> >> Stanislav Porotikov >> >> >> >> *From:* Mich Talebzadeh >> *Sent:* Wednesday, Dec

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
Hi Stanislav , On Pyspark DF can you the following df.printSchema() and send the output please HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

RE: Pyspark UDF as a data source for streaming

2023-12-28 Thread Поротиков Станислав Вячеславович
Ok. Thank you very much! Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Thursday, December 28, 2023 5:14 PM To: Hyukjin Kwon Cc: Поротиков Станислав Вячеславович ; user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming You can work around this issue by

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
ков Станислав Вячеславович > wrote: > >> Yes, it's actual data. >> >> >> >> Best regards, >> >> Stanislav Porotikov >> >> >> >> *From:* Mich Talebzadeh >> *Sent:* Wednesday, December 27, 2023 9:43 PM >> *Cc:* us

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
av Porotikov > > > > *From:* Mich Talebzadeh > *Sent:* Wednesday, December 27, 2023 9:43 PM > *Cc:* user@spark.apache.org > *Subject:* Re: Pyspark UDF as a data source for streaming > > > > Is this generated data actual data or you are testing the application? >

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data. Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 9:43 PM Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Is this generated data actual data or you are testing the application? Sounds like a

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
tikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 6:17 PM To: Поротиков Станислав Вячеславович Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Ok so you want to generate some random data and load it into Kafka on a regular interval and the

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
v Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 6:17 PM To: Поротиков Станислав Вячеславович Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Ok so you want to generate some random data and load it into Kafka on a regular interval and the

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
e be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович wrote: > Hello! > > Is it possible to write pyspark UDF, generated data to streaming dataframe? > > I want to get some data from REST

Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Hello! Is it possible to write pyspark UDF, generated data to streaming dataframe? I want to get some data from REST API requests in real time and consider to save this data to dataframe. And then put it to Kafka. I can't realise how to create streaming dataframe from generated data. I am n

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков
Hey Enrico it does help to understand it, thanks for explaining. Regarding this comment > PySpark and Scala should behave identically here Is it ok that Scala and PySpark optimization works differently in this case? вт, 5 дек. 2023 г. в 20:08, Enrico Minack : > Hi Michail, &g

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-05 Thread Enrico Minack
Hi Michail, with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see how Spark optimizes the query plan. In PySpark, the plan is optimized into Project ...   +- CollectMetrics 2, [count(1) AS count(1)#200L]   +- LocalTableScan , [col1#125, col2#126L, col3#

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-04 Thread Enrico Minack
t("*")) df2.show() ++++----+ |col1|col2|col3|col4| +++++ +++++ o1.get Map[String,Any] = Map(count(1) -> 2) o2.get Map[String,Any] = Map(count(1) -> 0) Pyspark and Scala should behave identically here. I will investigate. Cheers, Enrico

[PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-02 Thread Михаил Кулаков
Hey folks, I actively using observe method on my spark jobs and noticed interesting behavior: Here is an example of working and non working code: https://gist.github.com/Coola4kov/8aeeb05abd39794f8362a3cf1c66519c In a few words, if I'm joining dataframe after some filter rules and it became empty,

How to configure authentication from a pySpark client to a Spark Connect server ?

2023-11-05 Thread Xiaolong Wang
Hi, Our company is currently introducing the Spark Connect server to production. Most of the issues have been solved yet I don't know how to configure authentication from a pySpark client to the Spark Connect server. I noticed that there is some interceptor configs at the Scala client

Re: Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Mich Talebzadeh
xException: > error parsing regexp: invalid escape sequence: '\m'* > > I tracked it down to *site-packages/pyspark/ml/util.py* line 578 > > metadataPath = os.path.join(path,"metadata") > > which seems innocuous but what's happening is because I'm on

Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Richard Smith
;\m'/ I tracked it down to /site-packages/pyspark/ml/util.py/ line 578 metadataPath = os.path.join(path,"metadata") which seems innocuous but what's happening is because I'm on Windows, os.path.join is appending double backslash, whilst the gcs path uses forward slashes

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Mich Talebzadeh
The fact that you have 60 partitions or brokers in kaka is not directly correlated to Spark Structured Streaming (SSS) executors by itself. See below. Spark starts with 200 partitions. However, by default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the node

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi Raghavendr

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi Raghavendra, Yes, we are trying to reduce the number of files in delta as well (the small file problem [0][1]). We already have a scheduled app to compact files, but the number of files is still large, at 14K files per day. [0]: https://docs.delta.io/latest/optimizations-oss.html#language-pyt

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic, then

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: > >

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic, then

using facebook Prophet + pyspark for forecasting - Dataframe has less than 2 non-NaN rows

2023-09-29 Thread karan alang
Hello - Anyone used Prophet + pyspark for forecasting ? I'm trying to backfill forecasts, and running into issues (error - Dataframe has less than 2 non-NaN rows) I'm removing all records with NaN values, yet getting this error. details are in stackoverflow link -> https://stac

[PySpark][Spark logs] Is it possible to dynamically customize Spark logs?

2023-09-25 Thread Ayman Rekik
Hello, What would be the right way, if any, to inject a runtime variable into Spark logs. So that, for example, if Spark (driver/worker) logs some info/warning/error message, the variable will be output there (in order to help filtering logs for the sake of monitoring and troubleshooting). Re

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
. > > On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong > wrote: > >> Hi, >> >> Are there any plans to upload PySpark 3.5.0 to PyPI ( >> https://pypi.org/project/pyspark/)? It's still 3.4.1. >> >> Thanks, >> Kezhi >> >> >>

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( > https:/

PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Hi, Are there any plans to upload PySpark 3.5.0 to PyPI ( https://pypi.org/project/pyspark/)? It's still 3.4.1. Thanks, Kezhi

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
isk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damag

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
ical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 19 Sept 2023 at 21:50, Sean Owen wrote: > Pyspark follows SQL databases here. stddev is stddev_samp, and sample > standard deviation is the

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen, Assuming you want to calculate stddev_samp, Spark correctly points STDDEV to STDDEV_SAMP. In below replace sales with your table name and AMOUNT_SOLD with the column you want to do the calculation SELECT SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
df.select(stddev_samp("value").alias("sample_stddev")).show() +-+ |sample_stddev| +-+ |5.320025062597606| +-+ In MS Excel 365 Norwegian [image: image.png] =STDAVVIKA(B1:B10) =STDAV.S(B1:B10) They both prints 5,32002506 Whi

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > &

Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Helene Bøe
Hi! I am applying the stddev function (so actually stddev_samp), however when comparing with the sample standard deviation in Excel the resuls do not match. I cannot find in your documentation any more specifics on how the sample standard deviation is calculated, so I cannot compare the differen

  1   2   3   4   5   6   7   8   9   10   >