Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

nirandap Tue, 22 Nov 2016 03:12:13 -0800

Hi Maciej,

Thank you for your reply.


I have 2 queries.
1. I can understand your explanation. But in my experience, when I check
the final RDBMS table, I see that the results follow the expected order,
without an issue. Is this just a coincidence?

2. I was further looking into this. So, say I run this query
"select value, count(*) from table1 group by value order by value"

and I call df.collect() in the resultant dataframe. From my experience, I
see that the given values follow the expected order. May I know how spark
manages to retain the order of the results in a collect operation?

Best


On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
Developers List] <[email protected]> wrote:

> In commonly used RDBM systems relations have no fixed order and physical
> location of the records can change during routine maintenance operations.
> Unless you explicitly order data during retrieval order you see is
> incidental and not guaranteed.
>
> Conclusion: order of inserts just doesn't matter.
> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>
> Hi,
>
> Say, I have a table with 1 column and 1000 rows. I want to save the result
> in a RDBMS table using the jdbc relation provider. So I run the following
> query,
>
> "insert into table table2 select value, count(*) from table1 group by
> value order by value"
>
> While debugging, I found that the resultant df from select value, count(*)
> from table1 group by value order by value would have around 200+ partitions
> and say I have 4 executors attached to my driver. So, I would have 200+
> writing tasks assigned to 4 executors. I want to understand, how these
> executors are able to write the data to the underlying RDBMS table of
> table2 without messing up the order.
>
> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
> following
>
> df.foreachPartition { iterator =>
>       savePartition(getConnection, table, iterator, rddSchema, nullTypes,
> batchSize, dialect)
>     }
>
> So, my understanding is, all of my 4 executors will parallely run the
> savePartition function (or closure) where they do not know which one should
> write data before the other!
>
> In the savePartition method, in the comment, it says
> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>    * a single database transaction in order to avoid repeatedly inserting
>    * data as much as possible."
>
> I want to understand, how these parallel executors save the partition
> without harming the order of the results? Is it by locking the database
> resource, from each executor (i.e. ex0 would first obtain a lock for the
> table and write the partition0, while ex1 ... ex3 would wait till the lock
> is released )?
>
> In my experience, there is no harm done to the order of the results at the
> end of the day!
>
> Would like to hear from you guys! :-)
>
> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/
> src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
> scala#L277
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-
> provider-when-inserting-data-from-multiple-executors-tp19970p19971.html
> To start a new topic under Apache Spark Developers List, email
> [email protected]
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bmlyYW5kYS5wZXJlcmFAZ21haWwuY29tfDF8NjAxMDUyMzU5>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Reply via email to