Hi Maciej, Thank you for your reply.
I have 2 queries. 1. I can understand your explanation. But in my experience, when I check the final RDBMS table, I see that the results follow the expected order, without an issue. Is this just a coincidence? 2. I was further looking into this. So, say I run this query "select value, count(*) from table1 group by value order by value" and I call df.collect() in the resultant dataframe. From my experience, I see that the given values follow the expected order. May I know how spark manages to retain the order of the results in a collect operation? Best On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark Developers List] <ml-node+s1001551n19971...@n3.nabble.com> wrote: > In commonly used RDBM systems relations have no fixed order and physical > location of the records can change during routine maintenance operations. > Unless you explicitly order data during retrieval order you see is > incidental and not guaranteed. > > Conclusion: order of inserts just doesn't matter. > On 11/21/2016 10:03 AM, Niranda Perera wrote: > > Hi, > > Say, I have a table with 1 column and 1000 rows. I want to save the result > in a RDBMS table using the jdbc relation provider. So I run the following > query, > > "insert into table table2 select value, count(*) from table1 group by > value order by value" > > While debugging, I found that the resultant df from select value, count(*) > from table1 group by value order by value would have around 200+ partitions > and say I have 4 executors attached to my driver. So, I would have 200+ > writing tasks assigned to 4 executors. I want to understand, how these > executors are able to write the data to the underlying RDBMS table of > table2 without messing up the order. > > I checked the jdbc insertable relation and in jdbcUtils [1] it does the > following > > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > > So, my understanding is, all of my 4 executors will parallely run the > savePartition function (or closure) where they do not know which one should > write data before the other! > > In the savePartition method, in the comment, it says > "Saves a partition of a DataFrame to the JDBC database. This is done in > * a single database transaction in order to avoid repeatedly inserting > * data as much as possible." > > I want to understand, how these parallel executors save the partition > without harming the order of the results? Is it by locking the database > resource, from each executor (i.e. ex0 would first obtain a lock for the > table and write the partition0, while ex1 ... ex3 would wait till the lock > is released )? > > In my experience, there is no harm done to the order of the results at the > end of the day! > > Would like to hear from you guys! :-) > > [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/ > src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils. > scala#L277 > > -- > Niranda Perera > @n1r44 <https://twitter.com/N1R44> > +94 71 554 8430 > https://www.linkedin.com/in/niranda > https://pythagoreanscript.wordpress.com/ > > > -- > Best regards, > Maciej Szymkiewicz > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/How-is-the-order-ensured-in-the-jdbc-relation- > provider-when-inserting-data-from-multiple-executors-tp19970p19971.html > To start a new topic under Apache Spark Developers List, email > ml-node+s1001551n1...@n3.nabble.com > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bmlyYW5kYS5wZXJlcmFAZ21haWwuY29tfDF8NjAxMDUyMzU5> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- Niranda Perera @n1r44 <https://twitter.com/N1R44> +94 71 554 8430 https://www.linkedin.com/in/niranda https://pythagoreanscript.wordpress.com/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.