Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
(cross-posting from the HBase user list as I didn't receive a reply there) Hello, I'm completely new to Spark and evaluating setting up a cluster either in YARN or standalone. Our idea for the general workflow is create a concatenated dataframe using historical pickle/parquet files (whichever i

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
om relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 5 Jan 2023 at 09:35, Aaron Grubb mailto:aa...@kaden.ai>> wrote: (cross-posting from the HBase use

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-06 Thread Aaron Grubb
ll in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 5 Jan 2023 at 22:53, Aaron Grubb mailto:aa...@kaden.ai>> wrote: Hi Mich, Thanks for your reply. In hindsight I realize I didn't provide enough information about the infr

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using AWS. From: Oxlade, Dan Sent: Wednesday, April 3, 2024 2:41:11 PM To: user@spark.apache.org Subject: [Sp

Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-19 Thread Aaron Grubb
Hi all, I'm running Spark on Kubernetes on AWS using only spot instances for executors with dynamic allocation enabled. This particular job is being triggered by Airflow and it hit this bug [1] 6 times in a row. However, I had recently switched to using PersistentVolumeClaims in Spark with spark

Re: Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-20 Thread Aaron Grubb
at 13:01 +0000, Aaron Grubb wrote: > Hi all, > > I'm running Spark on Kubernetes on AWS using only spot instances for > executors with dynamic allocation enabled. This particular job is > being > triggered by Airflow and it hit this bug [1] 6 times in a row. However, I had

Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie

2025-02-06 Thread Aaron Grubb
Hi all, I filed a bug with the Iceberg team [1] but I'm not sure that it's 100% specific to Iceberg (I assume it is as data in the related parquet file is correct and session.read.parquet always returns correct results) so I figured I would flag it here in case anyone has some insight. Currently

Re: Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie

2025-02-06 Thread Aaron Grubb
Someone just replied to the bug, it was already known about and will be fixed in the upcoming Iceberg 1.7.2 release. On Thu, 2025-02-06 at 09:35 +, Aaron Grubb wrote: > Hi all, > > I filed a bug with the Iceberg team [1] but I'm not sure that it's 100% > specific to I

Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-13 Thread Aaron Grubb
Hi all, I'm trying to figure out how to persist a table definition in a catalog that can be used from different sessions. Something along the lines of --- CREATE TABLE spark_catalog.default.test_table ( name string ) USING jdbc OPTIONS ( driver 'com.mysql.cj.jdbc.Driver'

Re: Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-14 Thread Aaron Grubb
2 On Mon, 2025-01-13 at 18:49 +0000, Aaron Grubb wrote: > Hi all, > > I'm trying to figure out how to persist a table definition in a catalog that > can be used from different sessions. Something along the lines > of > > --- > CREATE TABLE spark_catal