Hi Mich and Pol, Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster restarted so I lost the stack trace in the application UI. In the snippets I saved, it looks like the exception being thrown was from Hive. Given the feedback you've provided, I suspect the issue is with how the Hive components are handling concurrent writes.
While using a different format would likely help with this issue, I think I have found an easier solution for now. Currently I have many individual scripts that perform logic and insert the results separately. Instead of each script performing an insert, each script can instead create a view. After the views are created, one single script can perform one single INSERT, combining the views with UNION ALL statements. -- Old logic -- -- Script 1 INSERT INTO EventClaims /*Long, complicated query 1*/ -- Script N INSERT INTO EventClaims /*Long, complicated query N*/ -- New logic -- -- Script 1 CREATE VIEW Q1 AS /*Long, complicated query 1*/ -- Script N CREATE VIEW QN AS /*Long, complicated query N*/ -- Final script -- INSERT INTO EventClaims SELECT * FROM Q1 UNION ALL SELECT * FROM QN The old approach had almost two dozen stages with relatively fewer tasks. The new approach requires only 3 stages. With fewer stages and more tasks, cluster utilization is much higher. Thanks again for your feedback. I suspect better concurrent writes will be valuable for my project in the future, so this is good information to have ready. Thanks, Patrick On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria <p...@qbeast.io> wrote: > Hi Patrick, > > You can have multiple writers simultaneously writing to the same table in > HDFS by utilizing an open table format with concurrency control. Several > formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast > Format, offer this capability. All of them provide advanced features that > will work better in different use cases according to the writing pattern, > type of queries, data characteristics, etc. > > *Pol Santamaria* > > > On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> It is not Spark SQL that throws the error. It is the underlying Database >> or layer that throws the error. >> >> Spark acts as an ETL tool. What is the underlying DB where the table >> resides? Is concurrency supported. Please send the error to this list >> >> HTH >> >> Mich Talebzadeh, >> Solutions Architect/Engineering Lead >> Palantir Technologies Limited >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci <patrick.tu...@gmail.com> >> wrote: >> >>> Hello, >>> >>> I'm building an application on Spark SQL. The cluster is set up in >>> standalone mode with HDFS as storage. The only Spark application running is >>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted >>> to Thrift Server using beeline. >>> >>> I have multiple queries that insert rows into the same table >>> (EventClaims). These queries work fine when run sequentially, however, some >>> individual queries don't fully utilize the resources available on the >>> cluster. I would like to run all of these queries concurrently to more >>> fully utilize available resources. When I attempt to do this, tasks >>> eventually begin to fail. The stack trace is pretty long, but here's what >>> looks like the most relevant parts: >>> >>> >>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788) >>> >>> org.apache.hive.service.cli.HiveSQLException: Error running query: >>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 >>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage >>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException: >>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs:// >>> 10.0.50.1:8020/user/spark/warehouse/eventclaims. >>> >>> Is it possible to have multiple concurrent writers to the same table >>> with Spark SQL? Is there any way to make this work? >>> >>> Thanks for the help. >>> >>> Patrick >>> >>