subject:"SQL"

[Spark SQL] [How-to] Can columns be excluded from a scan performed as part of an update?

2025-08-28 Thread William Muesing

Hello! This is my first time using a mailing list like this, apologies if I’ve missed something to conform to standards for it. I’m using the Java API to interact with a data source that’s column-based, and expensive to request entire rows from. However, using the interface that my Table needs to

[Spark SQL]: Python Data Source API and spark.sql.execution.pyspark.python

2025-07-24 Thread Ilya

Dear Spark Community, Why Python Data Source API (pyspark.sql.datasource.Datasource) is not using "spark.sql.execution.pyspark.python" config, but UDF do? Datasource 1) executor always looks for "python3" ignoring "spark.sql.execution.pyspark.python" config 2) so provided dependencies not loaded

[Spark SQL]: Spark 4 logs warning and stack trace when loading dataframe from path containing wildcard

2025-07-22 Thread Glenn J

rquet/*.parquet at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4156) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:4007) I think it's due to the change from this https://github.com/apache/spark/blob/v3.5.6/sql/core/src/main/scala/org/apache/

[Spark SQL]: Spark can't read views created via Trino using enableHiveSupport.

2025-07-15 Thread Tal Haimov

containing a Base64-encoded JSON blob (e.g., /* Presto View */). This format is not directly readable by Spark when using a session with .enableHiveSupport() and the config spark.sql.hive.convertMetastoreView=true (or equivalently, spark.enableHive=true), as Spark expects standard SQL strings in the

[SQL]: Registering spark extensions which utilise DataSourceV2Strategy in Spark 4

2025-06-16 Thread Jack Buggins

ategy that utilises the data source v2 strategy API<https://github.com/apache/spark/blob/59e6b5b7d350a1603502bc92e3c117311ab2cbb6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L56>. The DSv2 API requires a type of session of the type <http

Re: [Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-02 Thread Sathi Chowdhury

I think it is not happening because it is a ddl time and upsert operation does not recreate the partition. It is just a dml statement. Sent from Yahoo Mail for iPhone On Friday, May 2, 2025, 7:53 AM, Pradeep wrote: I have a partitioned hive external table as belowscala> spark.sql("describe

[Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

2025-05-01 Thread Pradeep

zation.format=1] | |Partition Provider |Catalog | ++--+ Below is my existing partition created via spark sql scala> val c

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-12 Thread Frank Bertsch

eam. I have added to the >>>>> email. >>>>> >>>>> HTH >>>>> >>>>> Dr Mich Talebzadeh, >>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>> >>>>>view my L

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-11 Thread Allison Wang

ave added to the email. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh, >>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>> >>>>view my Linkedin profile >>>> <https://www.linkedin.com/

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Reynold Xin

Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>&

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Soumasish

>> HTH >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >> >>

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Frank Bertsch

HTH > > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Fri, 31 Jan 2025 at 14:28, Frank Bertsch > wrote:

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-01-31 Thread Mich Talebzadeh

at 14:28, Frank Bertsch wrote: > Hi All - > > I'm working heavily in Spark SQL lately. Specifically, I've been trying to > understand if SQL UDFs, similar to the databricks offering > <https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functi

[Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-01-31 Thread Frank Bertsch

Hi All - I'm working heavily in Spark SQL lately. Specifically, I've been trying to understand if SQL UDFs, similar to the databricks offering <https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functions.html>, is being tracked as a feature request within A

Re: Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-14 Thread Aaron Grubb

FROM test_table WHERE type_id = 2' > ) > > > and then from another session, directly calling > > > SparkSession.builder.getOrCreate().sql('SELECT * FROM > spark_catalog.default.test_table').show() >

Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-13 Thread Aaron Grubb

dbc.Driver', url 'jdbc:mysql://example.com:3306/db', user 'user', password 'pass', query 'SELECT name FROM test_table WHERE type_id = 2' ) and then from another session, directly calling -------

Re: [Spark SQL] [DISK_ONLY Persistence] getting "this.inMemSorter" is null exception

2024-11-13 Thread Ashwani Pundir

Thanks for the response. Seems like a limitation. If resources are available then why bother about splitting the jobs in smaller durations(performance is not the concern). This issue is not about the performance optimization but rather the job is failing with null pointer exception. Do you have

Re: [Spark SQL] [DISK_ONLY Persistence] getting "this.inMemSorter" is null exception

2024-11-12 Thread Gurunandan

You should be able to split large job into more manageable jobs based on stages using checkpoint. if a job fails, Job can be restarted from the latest checkpoint, saving time and resources, thus xheckpoints can be used as recovery points. Smaller stages can be optimized independently, leading to be

Re: [Spark SQL] [DISK_ONLY Persistence] getting "this.inMemSorter" is null exception

2024-11-11 Thread Gurunandan

Hi Ashwani, Please verify input data by ensuring that the data being processed is valid and free of null values or unexpected data types. if data undergoes complex transformations before sorting review the data Transformations, verify that data transformations don't introduce inconsistencies or nul

Re: Bugs with joins and SQL in Structured Streaming

2024-10-01 Thread Andrzej Zera

stream, though it's not >>> participated in the first time interval join. That said, lower bound of et1 >>> = et3 - 5 mins ~ et3, which is, lower bound of et1 = (wm - 3 mins) - 5 mins >>> ~ (wm - 3 mins) = wm - 8 mins ~ wm - 3 mins. That's why moving the >>>

Re: Bugs with joins and SQL in Structured Streaming

2024-09-30 Thread Jungtaek Lim

said, lower bound of et1 >> = et3 - 5 mins ~ et3, which is, lower bound of et1 = (wm - 3 mins) - 5 mins >> ~ (wm - 3 mins) = wm - 8 mins ~ wm - 3 mins. That's why moving the >> watermark to window.end + 5 mins does not produce the output and fails the >> test. >> >&

Re: Bugs with joins and SQL in Structured Streaming

2024-09-29 Thread Jungtaek Lim

now if this does not make sense to you and we can discuss > more. > > I haven't had time to look into SqlSyntaxTest - we don't have enough tests > on interop between DataFrame <-> SQL for streaming query, so we might have > a non-trivial number of unknowns. I (or folks in my t

Re: Bugs with joins and SQL in Structured Streaming

2024-09-28 Thread Jungtaek Lim

sense to you and we can discuss more. I haven't had time to look into SqlSyntaxTest - we don't have enough tests on interop between DataFrame <-> SQL for streaming query, so we might have a non-trivial number of unknowns. I (or folks in my team) will take a look sooner than later. Tha

Spark SQL readSideCharPadding issue while reading ENUM column from mysql

2024-09-21 Thread Suyash Ajmera

I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND

Re: [Issue] Spark SQL - broadcast failure

2024-08-01 Thread Sudharshan V

Hi all, Do we have any idea on this. Thanks On Tue, 23 Jul, 2024, 12:54 pm Sudharshan V, wrote: > We removed the explicit broadcast for that particular table and it took > longer time since the join type changed from BHJ to SMJ. > > I wanted to understand how I can find what went wrong with the

A code change for spark ui in Sql tab

2024-07-30 Thread Donvi

Hi, Community I'm a new onboarder in the Spark community and find some lag between Spark and DBR in Spark UI. This is in DBR for cost based optimizer in Spark UI: https://docs.databricks.com/en/optimizations/cbo.html#spark-sql-ui. To implement similar thing in open source part, I've

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V

We removed the explicit broadcast for that particular table and it took longer time since the join type changed from BHJ to SMJ. I wanted to understand how I can find what went wrong with the broadcast now. How do I know the size of the table inside of spark memory. I have tried to cache the tabl

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V

Hi all, apologies for the delayed response. We are using spark version 3.4.1 in jar and EMR 6.11 runtime. We have disabled the auto broadcast always and would broadcast the smaller tables using explicit broadcast. It was working fine historically and only now it is failing. The data sizes I men

[Spark SQL]: Why the OptimizeSkewedJoin rule does not optimize FullOuterJoin?

2024-07-22 Thread 王仲轩(万章)

Hi, I am a beginner in Spark and currently learning the Spark source code. I have a question about the AQE rule OptimizeSkewedJoin. I have a SQL query using SMJ FullOuterJoin, where there is read skew on the left side (the case is mentioned below). case: remote bytes read total (min, med, max

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Meena Rajani

Can you try disabling broadcast join and see what happens? On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V wrote: > Hi all, > > Been facing a weird issue lately. > In our production code base , we have an explicit broadcast for a small > table. > It is just a look up table that is around 1gb in siz

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Mich Talebzadeh

It will help if you mention the Spark version and the piece of problematic code HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London

[Issue] Spark SQL - broadcast failure

2024-07-08 Thread Sudharshan V

Hi all, Been facing a weird issue lately. In our production code base , we have an explicit broadcast for a small table. It is just a look up table that is around 1gb in size in s3 and just had few million records and 5 columns. The ETL was running fine , but with no change from the codebase nor

Does Spark 4.0 add Sparkstreaming SQL

2024-06-26 Thread ????

Docking with CDC through Spark Streaming SQL, writing flow processing logic and window functions in SQL, is Spark 4.0 supported 308027...@qq.com

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh

When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic.

[Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Juan Casse

I am using applyInPandasWithState in PySpark 3.5.0. I noticed that records with timestamp==NULL are processed (i.e., trigger a call to the stateful function). And, as you would expect, does not advance the watermark. I am taking advantage of this in my application. My question: Is this a support

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz

rk.apache.org Subject: Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing This message contains hyperlinks, take precaution before opening these links. Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproducing t

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh

in Spark 2.4.0. > > > *Heap Dump Analysis:*We performed a heap dump analysis after enabling > heap dump on out-of-memory errors, and the analysis revealed the following > significant frames and local variables: > > ``` > > org.apache.spark.sql.Dataset.withPlan(Lorg/a

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan

park 2.4.0. *Heap Dump Analysis:*We performed a heap dump analysis after enabling heap dump on out-of-memory errors, and the analysis revealed the following significant frames and local variables: ``` org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/Logical

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh

Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/pytho

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh

ny advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Fri, 3 May 2024 at 00:54, Mich Talebzadeh wrote: > An issue I encountered while

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim

not a bug or an issue. You can initiate a feature request and wish the community to include that into the roadmap. On Fri, May 3, 2024 at 12:01 PM Mich Talebzadeh wrote: > An issue I encountered while working with Materialized Views in Spark SQL. > It appears that there is an inconsistency be

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa

some work in the Iceberg community to add the support to Spark through SQL extensions, and Iceberg support for views and materialization tables. Some recent discussions can be found here [1] along with a WIP Iceberg-Spark PR. [1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh

An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????

In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian

spark-sql called FunctionRegistry that seems to act as an allowlist on what functions Spark can execute. If I remove a function of the registry, is that enough guarantee that that function can "never" be invoked in Spark, or are there other areas that would need to be changed as well?

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian

spark-sql called FunctionRegistry that seems to act as an allowlist on what functions Spark can execute. If I remove a function of the registry, is that enough guarantee that that function can "never" be invoked in Spark, or are there other areas that would need to be changed as well?

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria

Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java (slic

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon

Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh

interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718> # Define schema for parsing Kafka messages schema = Stru

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯

t;> > col("parsed_value.rowkey").alias("rowkey") \ >>> >, col("parsed_value.timestamp").alias("timestamp") >>> \ >>> >, >>> col("parsed_value.temperature").alias(&qu

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh

Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition with

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon

Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at org.apache.spark.sql.execution.datasources

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯

ue) >> > |-- avg(temperature): double (nullable = true) >> > >> > """ >> > resultM = resultC. \ >> > withWatermark("timestamp", "5 minutes"). \ >> >

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh

x27;) > > > > # We take the above DataFrame and flatten it to get the > columns > > aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature" > > resultMF = resultM. \ > >select(

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He

tr(uuid.uuid4()),StringType()) > > """ > We take DataFrame resultMF containing temperature info and > write it to Kafka. The uuid is serialized as a string and used as the key. > We take all the columns of the DataFrame and seria

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh

("uuid",uuidUdf()) \ .selectExpr("CAST(uuid AS STRING) AS key", "to_json(struct(startOfWindow, endOfWindow, AVGTemperature)) AS value") \ .writeStream \ .outputMode('complete') \ .format("kaf

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He

Hello! I am attempting to write a streaming pipeline that would consume data from a Kafka source, manipulate the data, and then write results to a downstream sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using the function API that Spark offers. I read a few guides on

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera

ructured Streaming in production for almost a year >>> already and I want to share the bugs I found in this time. I created a test >>> for each of the issues and put them all here: >>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >>&g

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera

>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >> >> I split the issues into three groups: outer joins on event time, interval >> joins and Spark SQL. >> >> Issues related to outer joins: >> >>- When joining three o

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh

ach of the issues and put them all here: > https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala > > I split the issues into three groups: outer joins on event time, interval > joins and Spark SQL. > > Issues related to outer joins: > >- When joining thre

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera

ssues into three groups: outer joins on event time, interval joins and Spark SQL. Issues related to outer joins: - When joining three or more input streams on event time, if two or more streams don't contain an event for a join key (which is event time), no row will be output eve

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn

Hi, I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into an issue with some of our queries that read from PostgreSQL databases. Any attempt to run a Spark SQL query that selects a bpchar without a length specifier from the source DB seems to crash

Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta

Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfl

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh

Worth trying EXPLAIN <https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement as suggested by @tianlangstudio HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.co

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen

You can try sqlfluff <https://sqlfluff.com/> it's a linter for SQL code and it seems to have support for sparksql <https://pypi.org/project/sqlfluff/> man. 25. des. 2023 kl. 17:13 skrev ram manickam : > Thanks Mich, Nicholas. I tried looking over the stack overflow pos

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen

table or column existence. >> >> is not correct. When you call spark.sql(…), Spark will lookup the table >> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. >> >> Also, when you run DDL via spark.sql(…), Spark will actually run it. So >> spark

回复：Validate spark sql

2023-12-25 Thread tianlangstudio

What about EXPLAIN? https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content <https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content > <https://www.upwork.com/fl/huanqingzhu > <https://www.tianlang.tech/ >Fusion Zhu <http

Re: Validate spark sql

2023-12-24 Thread ram manickam

; We are not validating against table or column existence. >> >> is not correct. When you call spark.sql(…), Spark will lookup the table >> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. >> >> Also, when you run DDL via spark.sql(…), Spark will

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh

p the table > references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. > > Also, when you run DDL via spark.sql(…), Spark will actually run it. So > spark.sql(“drop table my_table”) will actually drop my_table. It’s not a > validation-only operation. > > This question of val

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas

and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. Also, when you run DDL via spark.sql(…), Spark will actually run it. So spark.sql(“drop table my_table”) will actually drop my_table. It’s not a validation-only operation. This question of validating SQL is already discussed on St

[sql] how to connect query stage to Spark job/stages?

2023-11-29 Thread Chenghao Lyu

Hi, I am seeking advice on measuring the performance of each QueryStage (QS) when AQE is enabled in Spark SQL. Specifically, I need help to automatically map a QS to its corresponding jobs (or stages) to get the QS runtime metrics. I recorded the QS structure via a customized injected Query

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo

Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera

> > On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, > wrote: > >> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am >> querying to Mysql Database and applying >> >> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working >>

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman

Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the code

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera

ark 3.3.1 to spark 3.5.0, I am > querying to Mysql Database and applying > > `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working > as expected in spark 3.3.1 , but not working with 3.5.0. > > Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) =

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera

I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh

ift >>> server, which I launch like so: >>> >>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077 >>> >>> The cluster runs in standalone mode and does not use Yarn for resource >>> management. As a result, the Spark Thrift ser

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci

t; The only application that runs on the cluster is the Spark Thrift server, >> which I launch like so: >> >> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077 >> >> The cluster runs in standalone mode and does not use Yarn for resource >> managem

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh

. This is okay; as of right now, I am the > only user of the cluster. If I add more users, they will also be SQL users, > submitting queries through the Thrift server. > > Let me know if you have any other questions or thoughts. > > Thanks, > > Patrick > > On Thu, Aug 17,

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci

acquires all available cluster resources when it starts. This is okay; as of right now, I am the only user of the cluster. If I add more users, they will also be SQL users, submitting queries through the Thrift server. Let me know if you have any other questions or thoughts. Thanks, Patrick On Thu

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh

latest alpha as well. This appears to have worked, although I couldn't >>>>> figure out how to get it to use the metastore_db from Spark. >>>>> >>>>> After turning my attention back to Spark, I determined the issue. >>>>> After much

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci

tention back to Spark, I determined the issue. After >>>> much troubleshooting, I discovered that if I performed a COUNT(*) using >>>> the same JOINs, the problem query worked. I removed all the columns from >>>> the SELECT statement and added them one by one until

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh

lter on it, the query hangs and never completes. If I >>> remove all explicit references to this column, the query works fine. Since >>> I need this column in the results, I went back to the ETL and extracted the >>> values to a dimension table. I replaced the text column

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci

ed without issue. >> >> On the topic of Hive, does anyone have any detailed resources for how to >> set up Hive from scratch? Aside from the official site, since those >> instructions didn't work for me. I'm starting to feel uneasy about building >> my proc

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh

rting to feel uneasy about building > my process around Spark. There really shouldn't be any instances where I > ask Spark to run legal ANSI SQL code and it just does nothing. In the past > 4 days I've run into 2 of these instances, and the solution was more voodoo > and magic t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci

me. I'm starting to feel uneasy about building my process around Spark. There really shouldn't be any instances where I ask Spark to run legal ANSI SQL code and it just does nothing. In the past 4 days I've run into 2 of these instances, and the solution was more voodoo and magic than

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh

hor will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci >> wrote: >> >>> Hi Mich, >>> >>> Thanks for the feedback.

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci

on. > > > > > On Sat, 12 Aug 2023 at 12:03, Patrick Tucci > wrote: > >> Hi Mich, >> >> Thanks for the feedback. My original intention after reading your >> response was to stick to Hive for managing tables. Unfortunately, I'm >> running into anot

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh

destruction. On Sat, 12 Aug 2023 at 12:03, Patrick Tucci wrote: > Hi Mich, > > Thanks for the feedback. My original intention after reading your response > was to stick to Hive for managing tables. Unfortunately, I'm running into > another case of SQL scripts hanging. Since

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci

Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm goin

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh

rver, so I need to be > able to connect to Spark through Thrift server and have it write tables > using Delta Lake instead of Hive. From this StackOverflow question, it > looks like this is possible: > https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-m

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci

ely through Thrift server, so I need to be able to connect to Spark through Thrift server and have it write tables using Delta Lake instead of Hive. From this StackOverflow question, it looks like this is possible: https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh

Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy

Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci

is hadoop in your case and sounds like there is no >> password! >> >> Once inside that host, hive logs are kept in your case >> /tmp/hadoop/hive.log or go to /tmp and do >> >> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log >> &

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh

case and sounds like there is no > password! > > Once inside that host, hive logs are kept in your case > /tmp/hadoop/hive.log or go to /tmp and do > > /tmp> find ./ -name hive.log. It should be under /tmp/hive.log > > Try running the sql inside hive and see what it says >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh

are kept in your case /tmp/hadoop/hive.log or go to /tmp and do /tmp> find ./ -name hive.log. It should be under /tmp/hive.log Try running the sql inside hive and see what it says HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile <

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci

0.50.1:1 -n hadoop -f command.sql Thanks again for your help. Patrick On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh wrote: > Can you run this sql query through hive itself? > > Are you using this command or similar for your thrift server? > > beeline

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh

Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci

Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON ME.ID = MB.Enroll

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1025 matches

Mail list logo