Hi Maciej, Thanks again for the reply. Once small clarification about the answer about my #1 point. I put local[4] and shouldn't this be forcing spark to read from 4 partitions in parallel and write in parallel (by parallel I mean, the order from which partition, the data is read from a set of 4 partitions, is non-deterministic)? That was the reason why I was surprised to see that the final results are in the same order.
On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark Developers List] <ml-node+s1001551n19986...@n3.nabble.com> wrote: > On 11/22/2016 12:11 PM, nirandap wrote: > > Hi Maciej, > > Thank you for your reply. > > I have 2 queries. > 1. I can understand your explanation. But in my experience, when I check > the final RDBMS table, I see that the results follow the expected order, > without an issue. Is this just a coincidence? > > Not exactly a coincidence. This is typically a result of a physical > location on the disk. If writes and reads are sequential, (this is usually > the case) you'll see things in the expected order, but you have to remember > that location on disk is not stable. For example if you perform some > updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will > change and with it things you see. > > There of course more advanced mechanisms out there. For example modern > columnar RDBMS like HANA use techniques like dimensions sorting and > differential stores so even the initial order may differ. There probably > some other solutions which choose different strategies (maybe some times > series oriented projects?) I am not aware of. > > > 2. I was further looking into this. So, say I run this query > "select value, count(*) from table1 group by value order by value" > > and I call df.collect() in the resultant dataframe. From my experience, I > see that the given values follow the expected order. May I know how spark > manages to retain the order of the results in a collect operation? > > Once you execute ordered operation each partition is sorted and the order > of partitions defines the global ordering. All what collect does is just > preserving this order by creating an array of results for each partition > and flattening it. > > > Best > > > On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark > Developers List] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=19985&i=0>> wrote: > >> In commonly used RDBM systems relations have no fixed order and physical >> location of the records can change during routine maintenance operations. >> Unless you explicitly order data during retrieval order you see is >> incidental and not guaranteed. >> >> Conclusion: order of inserts just doesn't matter. >> On 11/21/2016 10:03 AM, Niranda Perera wrote: >> >> Hi, >> >> Say, I have a table with 1 column and 1000 rows. I want to save the >> result in a RDBMS table using the jdbc relation provider. So I run the >> following query, >> >> "insert into table table2 select value, count(*) from table1 group by >> value order by value" >> >> While debugging, I found that the resultant df from select value, >> count(*) from table1 group by value order by value would have around 200+ >> partitions and say I have 4 executors attached to my driver. So, I would >> have 200+ writing tasks assigned to 4 executors. I want to understand, how >> these executors are able to write the data to the underlying RDBMS table of >> table2 without messing up the order. >> >> I checked the jdbc insertable relation and in jdbcUtils [1] it does the >> following >> >> df.foreachPartition { iterator => >> savePartition(getConnection, table, iterator, rddSchema, nullTypes, >> batchSize, dialect) >> } >> >> So, my understanding is, all of my 4 executors will parallely run the >> savePartition function (or closure) where they do not know which one should >> write data before the other! >> >> In the savePartition method, in the comment, it says >> "Saves a partition of a DataFrame to the JDBC database. This is done in >> * a single database transaction in order to avoid repeatedly inserting >> * data as much as possible." >> >> I want to understand, how these parallel executors save the partition >> without harming the order of the results? Is it by locking the database >> resource, from each executor (i.e. ex0 would first obtain a lock for the >> table and write the partition0, while ex1 ... ex3 would wait till the lock >> is released )? >> >> In my experience, there is no harm done to the order of the results at >> the end of the day! >> >> Would like to hear from you guys! :-) >> >> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src >> /main/scala/org/apache/spark/sql/execution/datasources/ >> jdbc/JdbcUtils.scala#L277 >> >> -- >> Niranda Perera >> @n1r44 <https://twitter.com/N1R44> >> <a href="tel:%2B94%2071%20554%208430" value="+94715548430" >> target="_blank">+94 71 554 8430 >> https://www.linkedin.com/in/niranda >> https://pythagoreanscript.wordpress.com/ >> >> >> -- >> Best regards, >> Maciej Szymkiewicz >> >> >> >> ------------------------------ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3.nabble.com/ >> How-is-the-order-ensured-in-the-jdbc-relation-provider- >> when-inserting-data-from-multiple-executors-tp19970p19971.html >> To start a new topic under Apache Spark Developers List, email [hidden >> email] <http:///user/SendEmail.jtp?type=node&node=19985&i=1> >> To unsubscribe from Apache Spark Developers List, click here. >> NAML >> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > > > -- > Niranda Perera > @n1r44 <https://twitter.com/N1R44> > +94 71 554 8430 > https://www.linkedin.com/in/niranda > https://pythagoreanscript.wordpress.com/ > > ------------------------------ > View this message in context: Re: How is the order ensured in the jdbc > relation provider when inserting data from multiple executors > <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > > > -- > Maciej Szymkiewicz > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/How-is-the-order-ensured-in-the-jdbc-relation- > provider-when-inserting-data-from-multiple-executors-tp19970p19986.html > To start a new topic under Apache Spark Developers List, email > ml-node+s1001551n1...@n3.nabble.com > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bmlyYW5kYS5wZXJlcmFAZ21haWwuY29tfDF8NjAxMDUyMzU5> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- Niranda Perera @n1r44 <https://twitter.com/N1R44> +94 71 554 8430 https://www.linkedin.com/in/niranda https://pythagoreanscript.wordpress.com/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p20016.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.