Re: [Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-02 Thread Sathi Chowdhury
I think it is not happening because it is a ddl time and upsert operation does not recreate the partition. It is just a dml statement.  Sent from Yahoo Mail for iPhone On Friday, May 2, 2025, 7:53 AM, Pradeep wrote: I have a partitioned hive external table as belowscala> spark.sql("describe

[Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-01 Thread Pradeep
zation.format=1] | |Partition Provider |Catalog | ++--+ Below is my existing partition created via spark sql scala> val c

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-12 Thread Frank Bertsch
eam. I have added to the >>>>> email. >>>>> >>>>> HTH >>>>> >>>>> Dr Mich Talebzadeh, >>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>> >>>>>view my L

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-11 Thread Allison Wang
ave added to the email. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh, >>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>> >>>>view my Linkedin profile >>>> <https://www.linkedin.com/

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Reynold Xin
Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>&

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Soumasish
>> HTH >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >> >>

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Frank Bertsch
HTH > > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Fri, 31 Jan 2025 at 14:28, Frank Bertsch > wrote:

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-01-31 Thread Mich Talebzadeh
at 14:28, Frank Bertsch wrote: > Hi All - > > I'm working heavily in Spark SQL lately. Specifically, I've been trying to > understand if SQL UDFs, similar to the databricks offering > <https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functi

[Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-01-31 Thread Frank Bertsch
Hi All - I'm working heavily in Spark SQL lately. Specifically, I've been trying to understand if SQL UDFs, similar to the databricks offering <https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functions.html>, is being tracked as a feature request within A

Re: Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-14 Thread Aaron Grubb
FROM test_table WHERE type_id = 2' > ) > > > and then from another session, directly calling > > > SparkSession.builder.getOrCreate().sql('SELECT * FROM > spark_catalog.default.test_table').show() >

Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-13 Thread Aaron Grubb
dbc.Driver', url 'jdbc:mysql://example.com:3306/db', user 'user', password 'pass', query 'SELECT name FROM test_table WHERE type_id = 2' ) and then from another session, directly calling -------

Re: [Spark SQL] [DISK_ONLY Persistence] getting "this.inMemSorter" is null exception

2024-11-13 Thread Ashwani Pundir
Thanks for the response. Seems like a limitation. If resources are available then why bother about splitting the jobs in smaller durations(performance is not the concern). This issue is not about the performance optimization but rather the job is failing with null pointer exception. Do you have

Re: [Spark SQL] [DISK_ONLY Persistence] getting "this.inMemSorter" is null exception

2024-11-12 Thread Gurunandan
You should be able to split large job into more manageable jobs based on stages using checkpoint. if a job fails, Job can be restarted from the latest checkpoint, saving time and resources, thus xheckpoints can be used as recovery points. Smaller stages can be optimized independently, leading to be

Re: [Spark SQL] [DISK_ONLY Persistence] getting "this.inMemSorter" is null exception

2024-11-11 Thread Gurunandan
Hi Ashwani, Please verify input data by ensuring that the data being processed is valid and free of null values or unexpected data types. if data undergoes complex transformations before sorting review the data Transformations, verify that data transformations don't introduce inconsistencies or nul

Re: Bugs with joins and SQL in Structured Streaming

2024-10-01 Thread Andrzej Zera
stream, though it's not >>> participated in the first time interval join. That said, lower bound of et1 >>> = et3 - 5 mins ~ et3, which is, lower bound of et1 = (wm - 3 mins) - 5 mins >>> ~ (wm - 3 mins) = wm - 8 mins ~ wm - 3 mins. That's why moving the >>>

Re: Bugs with joins and SQL in Structured Streaming

2024-09-30 Thread Jungtaek Lim
said, lower bound of et1 >> = et3 - 5 mins ~ et3, which is, lower bound of et1 = (wm - 3 mins) - 5 mins >> ~ (wm - 3 mins) = wm - 8 mins ~ wm - 3 mins. That's why moving the >> watermark to window.end + 5 mins does not produce the output and fails the >> test. >> >&

Re: Bugs with joins and SQL in Structured Streaming

2024-09-29 Thread Jungtaek Lim
now if this does not make sense to you and we can discuss > more. > > I haven't had time to look into SqlSyntaxTest - we don't have enough tests > on interop between DataFrame <-> SQL for streaming query, so we might have > a non-trivial number of unknowns. I (or folks in my t

Re: Bugs with joins and SQL in Structured Streaming

2024-09-28 Thread Jungtaek Lim
sense to you and we can discuss more. I haven't had time to look into SqlSyntaxTest - we don't have enough tests on interop between DataFrame <-> SQL for streaming query, so we might have a non-trivial number of unknowns. I (or folks in my team) will take a look sooner than later. Tha

Spark SQL readSideCharPadding issue while reading ENUM column from mysql

2024-09-21 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND

Re: [Issue] Spark SQL - broadcast failure

2024-08-01 Thread Sudharshan V
Hi all, Do we have any idea on this. Thanks On Tue, 23 Jul, 2024, 12:54 pm Sudharshan V, wrote: > We removed the explicit broadcast for that particular table and it took > longer time since the join type changed from BHJ to SMJ. > > I wanted to understand how I can find what went wrong with the

A code change for spark ui in Sql tab

2024-07-30 Thread Donvi
Hi, Community I'm a new onboarder in the Spark community and find some lag between Spark and DBR in Spark UI. This is in DBR for cost based optimizer in Spark UI: https://docs.databricks.com/en/optimizations/cbo.html#spark-sql-ui. To implement similar thing in open source part, I've

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
We removed the explicit broadcast for that particular table and it took longer time since the join type changed from BHJ to SMJ. I wanted to understand how I can find what went wrong with the broadcast now. How do I know the size of the table inside of spark memory. I have tried to cache the tabl

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
Hi all, apologies for the delayed response. We are using spark version 3.4.1 in jar and EMR 6.11 runtime. We have disabled the auto broadcast always and would broadcast the smaller tables using explicit broadcast. It was working fine historically and only now it is failing. The data sizes I men

[Spark SQL]: Why the OptimizeSkewedJoin rule does not optimize FullOuterJoin?

2024-07-22 Thread 王仲轩(万章)
Hi, I am a beginner in Spark and currently learning the Spark source code. I have a question about the AQE rule OptimizeSkewedJoin. I have a SQL query using SMJ FullOuterJoin, where there is read skew on the left side (the case is mentioned below). case: remote bytes read total (min, med, max

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Meena Rajani
Can you try disabling broadcast join and see what happens? On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V wrote: > Hi all, > > Been facing a weird issue lately. > In our production code base , we have an explicit broadcast for a small > table. > It is just a look up table that is around 1gb in siz

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Mich Talebzadeh
It will help if you mention the Spark version and the piece of problematic code HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London

[Issue] Spark SQL - broadcast failure

2024-07-08 Thread Sudharshan V
Hi all, Been facing a weird issue lately. In our production code base , we have an explicit broadcast for a small table. It is just a look up table that is around 1gb in size in s3 and just had few million records and 5 columns. The ETL was running fine , but with no change from the codebase nor

Does Spark 4.0 add Sparkstreaming SQL

2024-06-26 Thread ????
Docking with CDC through Spark Streaming SQL, writing flow processing logic and window functions in SQL, is Spark 4.0 supported 308027...@qq.com  

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic.

[Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Juan Casse
I am using applyInPandasWithState in PySpark 3.5.0. I noticed that records with timestamp==NULL are processed (i.e., trigger a call to the stateful function). And, as you would expect, does not advance the watermark. I am taking advantage of this in my application. My question: Is this a support

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
rk.apache.org Subject: Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing This message contains hyperlinks, take precaution before opening these links. Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproducing t

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
in Spark 2.4.0. > > > *Heap Dump Analysis:*We performed a heap dump analysis after enabling > heap dump on out-of-memory errors, and the analysis revealed the following > significant frames and local variables: > > ``` > > org.apache.spark.sql.Dataset.withPlan(Lorg/a

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan
park 2.4.0. *Heap Dump Analysis:*We performed a heap dump analysis after enabling heap dump on out-of-memory errors, and the analysis revealed the following significant frames and local variables: ``` org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/Logical

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/pytho

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
ny advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Fri, 3 May 2024 at 00:54, Mich Talebzadeh wrote: > An issue I encountered while

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
not a bug or an issue. You can initiate a feature request and wish the community to include that into the roadmap. On Fri, May 3, 2024 at 12:01 PM Mich Talebzadeh wrote: > An issue I encountered while working with Materialized Views in Spark SQL. > It appears that there is an inconsistency be

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
some work in the Iceberg community to add the support to Spark through SQL extensions, and Iceberg support for views and materialization tables. Some recent discussions can be found here [1] along with a WIP Iceberg-Spark PR. [1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com  

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
spark-sql called FunctionRegistry that seems to act as an allowlist on what functions Spark can execute. If I remove a function of the registry, is that enough guarantee that that function can "never" be invoked in Spark, or are there other areas that would need to be changed as well?

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
spark-sql called FunctionRegistry that seems to act as an allowlist on what functions Spark can execute. If I remove a function of the registry, is that enough guarantee that that function can "never" be invoked in Spark, or are there other areas that would need to be changed as well?

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java (slic

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718> # Define schema for parsing Kafka messages schema = Stru

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
t;> > col("parsed_value.rowkey").alias("rowkey") \ >>> >, col("parsed_value.timestamp").alias("timestamp") >>> \ >>> >, >>> col("parsed_value.temperature").alias(&qu

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition with

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at org.apache.spark.sql.execution.datasources

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
ue) >> > |-- avg(temperature): double (nullable = true) >> > >> > """ >> > resultM = resultC. \ >> > withWatermark("timestamp", "5 minutes"). \ >> >

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
x27;) > > > > # We take the above DataFrame and flatten it to get the > columns > > aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature" > > resultMF = resultM. \ > >select(

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
tr(uuid.uuid4()),StringType()) > > """ > We take DataFrame resultMF containing temperature info and > write it to Kafka. The uuid is serialized as a string and used as the key. > We take all the columns of the DataFrame and seria

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
("uuid",uuidUdf()) \ .selectExpr("CAST(uuid AS STRING) AS key", "to_json(struct(startOfWindow, endOfWindow, AVGTemperature)) AS value") \ .writeStream \ .outputMode('complete') \ .format("kaf

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello! I am attempting to write a streaming pipeline that would consume data from a Kafka source, manipulate the data, and then write results to a downstream sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using the function API that Spark offers. I read a few guides on

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
ructured Streaming in production for almost a year >>> already and I want to share the bugs I found in this time. I created a test >>> for each of the issues and put them all here: >>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >>&g

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >> >> I split the issues into three groups: outer joins on event time, interval >> joins and Spark SQL. >> >> Issues related to outer joins: >> >>- When joining three o

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
ach of the issues and put them all here: > https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala > > I split the issues into three groups: outer joins on event time, interval > joins and Spark SQL. > > Issues related to outer joins: > >- When joining thre

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera
ssues into three groups: outer joins on event time, interval joins and Spark SQL. Issues related to outer joins: - When joining three or more input streams on event time, if two or more streams don't contain an event for a join key (which is event time), no row will be output eve

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn
Hi, I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into an issue with some of our queries that read from PostgreSQL databases. Any attempt to run a Spark SQL query that selects a bpchar without a length specifier from the source DB seems to crash

Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfl

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
Worth trying EXPLAIN <https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement as suggested by @tianlangstudio HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.co

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
You can try sqlfluff <https://sqlfluff.com/> it's a linter for SQL code and it seems to have support for sparksql <https://pypi.org/project/sqlfluff/> man. 25. des. 2023 kl. 17:13 skrev ram manickam : > Thanks Mich, Nicholas. I tried looking over the stack overflow pos

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
table or column existence. >> >> is not correct. When you call spark.sql(…), Spark will lookup the table >> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. >> >> Also, when you run DDL via spark.sql(…), Spark will actually run it. So >> spark

回复:Validate spark sql

2023-12-25 Thread tianlangstudio
What about EXPLAIN? https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content <https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content > <https://www.upwork.com/fl/huanqingzhu > <https://www.tianlang.tech/ >Fusion Zhu <http

Re: Validate spark sql

2023-12-24 Thread ram manickam
; We are not validating against table or column existence. >> >> is not correct. When you call spark.sql(…), Spark will lookup the table >> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. >> >> Also, when you run DDL via spark.sql(…), Spark will

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
p the table > references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. > > Also, when you run DDL via spark.sql(…), Spark will actually run it. So > spark.sql(“drop table my_table”) will actually drop my_table. It’s not a > validation-only operation. > > This question of val

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. Also, when you run DDL via spark.sql(…), Spark will actually run it. So spark.sql(“drop table my_table”) will actually drop my_table. It’s not a validation-only operation. This question of validating SQL is already discussed on St

[sql] how to connect query stage to Spark job/stages?

2023-11-29 Thread Chenghao Lyu
Hi, I am seeking advice on measuring the performance of each QueryStage (QS) when AQE is enabled in Spark SQL. Specifically, I need help to automatically map a QS to its corresponding jobs (or stages) to get the QS runtime metrics. I recorded the QS structure via a customized injected Query

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
> > On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, > wrote: > >> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am >> querying to Mysql Database and applying >> >> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working >>

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the code

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
ark 3.3.1 to spark 3.5.0, I am > querying to Mysql Database and applying > > `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working > as expected in spark 3.3.1 , but not working with 3.5.0. > > Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) =

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
ift >>> server, which I launch like so: >>> >>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077 >>> >>> The cluster runs in standalone mode and does not use Yarn for resource >>> management. As a result, the Spark Thrift ser

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
t; The only application that runs on the cluster is the Spark Thrift server, >> which I launch like so: >> >> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077 >> >> The cluster runs in standalone mode and does not use Yarn for resource >> managem

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
. This is okay; as of right now, I am the > only user of the cluster. If I add more users, they will also be SQL users, > submitting queries through the Thrift server. > > Let me know if you have any other questions or thoughts. > > Thanks, > > Patrick > > On Thu, Aug 17,

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
acquires all available cluster resources when it starts. This is okay; as of right now, I am the only user of the cluster. If I add more users, they will also be SQL users, submitting queries through the Thrift server. Let me know if you have any other questions or thoughts. Thanks, Patrick On Thu

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
latest alpha as well. This appears to have worked, although I couldn't >>>>> figure out how to get it to use the metastore_db from Spark. >>>>> >>>>> After turning my attention back to Spark, I determined the issue. >>>>> After much

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
tention back to Spark, I determined the issue. After >>>> much troubleshooting, I discovered that if I performed a COUNT(*) using >>>> the same JOINs, the problem query worked. I removed all the columns from >>>> the SELECT statement and added them one by one until

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
lter on it, the query hangs and never completes. If I >>> remove all explicit references to this column, the query works fine. Since >>> I need this column in the results, I went back to the ETL and extracted the >>> values to a dimension table. I replaced the text column

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
ed without issue. >> >> On the topic of Hive, does anyone have any detailed resources for how to >> set up Hive from scratch? Aside from the official site, since those >> instructions didn't work for me. I'm starting to feel uneasy about building >> my proc

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
rting to feel uneasy about building > my process around Spark. There really shouldn't be any instances where I > ask Spark to run legal ANSI SQL code and it just does nothing. In the past > 4 days I've run into 2 of these instances, and the solution was more voodoo > and magic t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
me. I'm starting to feel uneasy about building my process around Spark. There really shouldn't be any instances where I ask Spark to run legal ANSI SQL code and it just does nothing. In the past 4 days I've run into 2 of these instances, and the solution was more voodoo and magic than

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
hor will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci >> wrote: >> >>> Hi Mich, >>> >>> Thanks for the feedback.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
on. > > > > > On Sat, 12 Aug 2023 at 12:03, Patrick Tucci > wrote: > >> Hi Mich, >> >> Thanks for the feedback. My original intention after reading your >> response was to stick to Hive for managing tables. Unfortunately, I'm >> running into anot

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
destruction. On Sat, 12 Aug 2023 at 12:03, Patrick Tucci wrote: > Hi Mich, > > Thanks for the feedback. My original intention after reading your response > was to stick to Hive for managing tables. Unfortunately, I'm running into > another case of SQL scripts hanging. Since

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm goin

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
rver, so I need to be > able to connect to Spark through Thrift server and have it write tables > using Delta Lake instead of Hive. From this StackOverflow question, it > looks like this is possible: > https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-m

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
ely through Thrift server, so I need to be able to connect to Spark through Thrift server and have it write tables using Delta Lake instead of Hive. From this StackOverflow question, it looks like this is possible: https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
is hadoop in your case and sounds like there is no >> password! >> >> Once inside that host, hive logs are kept in your case >> /tmp/hadoop/hive.log or go to /tmp and do >> >> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log >> &

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
case and sounds like there is no > password! > > Once inside that host, hive logs are kept in your case > /tmp/hadoop/hive.log or go to /tmp and do > > /tmp> find ./ -name hive.log. It should be under /tmp/hive.log > > Try running the sql inside hive and see what it says >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
are kept in your case /tmp/hadoop/hive.log or go to /tmp and do /tmp> find ./ -name hive.log. It should be under /tmp/hive.log Try running the sql inside hive and see what it says HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile <

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
0.50.1:1 -n hadoop -f command.sql Thanks again for your help. Patrick On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh wrote: > Can you run this sql query through hive itself? > > Are you using this command or similar for your thrift server? > > beeline

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON ME.ID = MB.Enroll

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
gt; HDFS by utilizing an open table format with concurrency control. Several >> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast >> Format, offer this capability. All of them provide advanced features that >> will work better in different use cases according t

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
4:28 PM Mich Talebzadeh > wrote: > >> It is not Spark SQL that throws the error. It is the underlying Database >> or layer that throws the error. >> >> Spark acts as an ETL tool. What is the underlying DB where the table >> resides? Is concurrency supported.

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
that will work better in different use cases according to the writing pattern, type of queries, data characteristics, etc. *Pol Santamaria* On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh wrote: > It is not Spark SQL that throws the error. It is the underlying Database > or layer that

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions Architect

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello, I'm building an application on Spark SQL. The cluster is set up in standalone mode with HDFS as storage. The only Spark application running is the Spark Thrift Server using FAIR scheduling mode. Queries are submitted to Thrift Server using beeline. I have multiple queries that insert

  1   2   3   4   5   6   7   8   9   10   >