Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

nirandap Thu, 24 Nov 2016 18:36:07 -0800

Hi Maciej,

Thanks again for the reply. Once small clarification about the answer about
my #1 point.
I put local[4] and shouldn't this be forcing spark to read from 4
partitions in parallel and write in parallel (by parallel I mean, the order
from which partition, the data is read from a set of 4 partitions, is
non-deterministic)? That was the reason why I was surprised to see that the
final results are in the same order.


On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark
Developers List] <ml-node+s1001551n19986...@n3.nabble.com> wrote:

> On 11/22/2016 12:11 PM, nirandap wrote:
>
> Hi Maciej,
>
> Thank you for your reply.
>
> I have 2 queries.
> 1. I can understand your explanation. But in my experience, when I check
> the final RDBMS table, I see that the results follow the expected order,
> without an issue. Is this just a coincidence?
>
> Not exactly a coincidence. This is typically a result of a physical
> location on the disk. If writes and reads are sequential, (this is usually
> the case) you'll see things in the expected order, but you have to remember
> that location on disk is not stable. For example if you perform some
> updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will
> change and with it things you see.
>
> There of course more advanced mechanisms out there. For example modern
> columnar RDBMS like HANA use techniques like dimensions sorting and
> differential stores so even the initial order may differ. There probably
> some other solutions which choose different strategies (maybe some times
> series oriented projects?) I am not aware of.
>
>
> 2. I was further looking into this. So, say I run this query
> "select value, count(*) from table1 group by value order by value"
>
> and I call df.collect() in the resultant dataframe. From my experience, I
> see that the given values follow the expected order. May I know how spark
> manages to retain the order of the results in a collect operation?
>
> Once you execute ordered operation each partition is sorted and the order
> of partitions defines the global ordering. All what collect does is just
> preserving this order by creating an array of results for each partition
> and flattening it.
>
>
> Best
>
>
> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19985&i=0>> wrote:
>
>> In commonly used RDBM systems relations have no fixed order and physical
>> location of the records can change during routine maintenance operations.
>> Unless you explicitly order data during retrieval order you see is
>> incidental and not guaranteed.
>>
>> Conclusion: order of inserts just doesn't matter.
>> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>
>> Hi,
>>
>> Say, I have a table with 1 column and 1000 rows. I want to save the
>> result in a RDBMS table using the jdbc relation provider. So I run the
>> following query,
>>
>> "insert into table table2 select value, count(*) from table1 group by
>> value order by value"
>>
>> While debugging, I found that the resultant df from select value,
>> count(*) from table1 group by value order by value would have around 200+
>> partitions and say I have 4 executors attached to my driver. So, I would
>> have 200+ writing tasks assigned to 4 executors. I want to understand, how
>> these executors are able to write the data to the underlying RDBMS table of
>> table2 without messing up the order.
>>
>> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
>> following
>>
>> df.foreachPartition { iterator =>
>>       savePartition(getConnection, table, iterator, rddSchema, nullTypes,
>> batchSize, dialect)
>>     }
>>
>> So, my understanding is, all of my 4 executors will parallely run the
>> savePartition function (or closure) where they do not know which one should
>> write data before the other!
>>
>> In the savePartition method, in the comment, it says
>> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>>    * a single database transaction in order to avoid repeatedly inserting
>>    * data as much as possible."
>>
>> I want to understand, how these parallel executors save the partition
>> without harming the order of the results? Is it by locking the database
>> resource, from each executor (i.e. ex0 would first obtain a lock for the
>> table and write the partition0, while ex1 ... ex3 would wait till the lock
>> is released )?
>>
>> In my experience, there is no harm done to the order of the results at
>> the end of the day!
>>
>> Would like to hear from you guys! :-)
>>
>> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src
>> /main/scala/org/apache/spark/sql/execution/datasources/
>> jdbc/JdbcUtils.scala#L277
>>
>> --
>> Niranda Perera
>> @n1r44 <https://twitter.com/N1R44>
>> <a href="tel:%2B94%2071%20554%208430" value="+94715548430"
>> target="_blank">+94 71 554 8430
>> https://www.linkedin.com/in/niranda
>> https://pythagoreanscript.wordpress.com/
>>
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> How-is-the-order-ensured-in-the-jdbc-relation-provider-
>> when-inserting-data-from-multiple-executors-tp19970p19971.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=19985&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
> ------------------------------
> View this message in context: Re: How is the order ensured in the jdbc
> relation provider when inserting data from multiple executors
> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
> --
> Maciej Szymkiewicz
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-
> provider-when-inserting-data-from-multiple-executors-tp19970p19986.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bmlyYW5kYS5wZXJlcmFAZ21haWwuY29tfDF8NjAxMDUyMzU5>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p20016.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Reply via email to