Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

Patrick Tucci Sun, 30 Jul 2023 05:46:56 -0700

Hi Mich and Pol,

Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
restarted so I lost the stack trace in the application UI. In the snippets
I saved, it looks like the exception being thrown was from Hive. Given the
feedback you've provided, I suspect the issue is with how the Hive
components are handling concurrent writes.


While using a different format would likely help with this issue, I think I
have found an easier solution for now. Currently I have many individual
scripts that perform logic and insert the results separately. Instead of
each script performing an insert, each script can instead create a view.
After the views are created, one single script can perform one single
INSERT, combining the views with UNION ALL statements.

-- Old logic --
-- Script 1
INSERT INTO EventClaims
/*Long, complicated query 1*/

-- Script N
INSERT INTO EventClaims
/*Long, complicated query N*/

-- New logic --
-- Script 1
CREATE VIEW Q1 AS
/*Long, complicated query 1*/

-- Script N
CREATE VIEW QN AS
/*Long, complicated query N*/

-- Final script --
INSERT INTO EventClaims
SELECT * FROM Q1 UNION ALL
SELECT * FROM QN

The old approach had almost two dozen stages with relatively fewer tasks.
The new approach requires only 3 stages. With fewer stages and more tasks,
cluster utilization is much higher.

Thanks again for your feedback. I suspect better concurrent writes will be
valuable for my project in the future, so this is good information to have
ready.

Thanks,

Patrick

On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria <p...@qbeast.io> wrote:

> Hi Patrick,
>
> You can have multiple writers simultaneously writing to the same table in
> HDFS by utilizing an open table format with concurrency control. Several
> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
> Format, offer this capability. All of them provide advanced features that
> will work better in different use cases according to the writing pattern,
> type of queries, data characteristics, etc.
>
> *Pol Santamaria*
>
>
> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> It is not Spark SQL that throws the error. It is the underlying Database
>> or layer that throws the error.
>>
>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>> resides? Is concurrency supported. Please send the error to this list
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci <patrick.tu...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm building an application on Spark SQL. The cluster is set up in
>>> standalone mode with HDFS as storage. The only Spark application running is
>>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>>> to Thrift Server using beeline.
>>>
>>> I have multiple queries that insert rows into the same table
>>> (EventClaims). These queries work fine when run sequentially, however, some
>>> individual queries don't fully utilize the resources available on the
>>> cluster. I would like to run all of these queries concurrently to more
>>> fully utilize available resources. When I attempt to do this, tasks
>>> eventually begin to fail. The stack trace is pretty long, but here's what
>>> looks like the most relevant parts:
>>>
>>>
>>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>>
>>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>>
>>> Is it possible to have multiple concurrent writers to the same table
>>> with Spark SQL? Is there any way to make this work?
>>>
>>> Thanks for the help.
>>>
>>> Patrick
>>>
>>

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

Reply via email to