Hi all, To explain the scenario a bit more.
We need to retain the order when writing to the RDBMS tables. The way we found was to execute the DB Write *job* for each partition which is really costly. One reason being that the partition count is really high( 200) and it seems we cannot control the count( due to the count being inferred from the parent RDD). When we execute the insert job, the executors are run in parallel to execute the writing tasks which jumbles up the order. Is there anyway we can execute the tasks sequentially? or any other way of doing this? We have noticed that you handle this from inside Spark itself, to retain the order when writing to RDBMS from Spark. Thanks, Sachith On Fri, Nov 25, 2016 at 8:05 AM, nirandap <niranda.per...@gmail.com> wrote: > Hi Maciej, > > Thanks again for the reply. Once small clarification about the answer > about my #1 point. > I put local[4] and shouldn't this be forcing spark to read from 4 > partitions in parallel and write in parallel (by parallel I mean, the order > from which partition, the data is read from a set of 4 partitions, is > non-deterministic)? That was the reason why I was surprised to see that the > final results are in the same order. > > On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark > Developers List] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=20016&i=0>> wrote: > >> On 11/22/2016 12:11 PM, nirandap wrote: >> >> Hi Maciej, >> >> Thank you for your reply. >> >> I have 2 queries. >> 1. I can understand your explanation. But in my experience, when I check >> the final RDBMS table, I see that the results follow the expected order, >> without an issue. Is this just a coincidence? >> >> Not exactly a coincidence. This is typically a result of a physical >> location on the disk. If writes and reads are sequential, (this is usually >> the case) you'll see things in the expected order, but you have to remember >> that location on disk is not stable. For example if you perform some >> updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will >> change and with it things you see. >> >> There of course more advanced mechanisms out there. For example modern >> columnar RDBMS like HANA use techniques like dimensions sorting and >> differential stores so even the initial order may differ. There probably >> some other solutions which choose different strategies (maybe some times >> series oriented projects?) I am not aware of. >> >> >> 2. I was further looking into this. So, say I run this query >> "select value, count(*) from table1 group by value order by value" >> >> and I call df.collect() in the resultant dataframe. From my experience, I >> see that the given values follow the expected order. May I know how spark >> manages to retain the order of the results in a collect operation? >> >> Once you execute ordered operation each partition is sorted and the order >> of partitions defines the global ordering. All what collect does is just >> preserving this order by creating an array of results for each partition >> and flattening it. >> >> >> Best >> >> >> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark >> Developers List] <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=19985&i=0>> wrote: >> >>> In commonly used RDBM systems relations have no fixed order and physical >>> location of the records can change during routine maintenance operations. >>> Unless you explicitly order data during retrieval order you see is >>> incidental and not guaranteed. >>> >>> Conclusion: order of inserts just doesn't matter. >>> On 11/21/2016 10:03 AM, Niranda Perera wrote: >>> >>> Hi, >>> >>> Say, I have a table with 1 column and 1000 rows. I want to save the >>> result in a RDBMS table using the jdbc relation provider. So I run the >>> following query, >>> >>> "insert into table table2 select value, count(*) from table1 group by >>> value order by value" >>> >>> While debugging, I found that the resultant df from select value, >>> count(*) from table1 group by value order by value would have around 200+ >>> partitions and say I have 4 executors attached to my driver. So, I would >>> have 200+ writing tasks assigned to 4 executors. I want to understand, how >>> these executors are able to write the data to the underlying RDBMS table of >>> table2 without messing up the order. >>> >>> I checked the jdbc insertable relation and in jdbcUtils [1] it does the >>> following >>> >>> df.foreachPartition { iterator => >>> savePartition(getConnection, table, iterator, rddSchema, >>> nullTypes, batchSize, dialect) >>> } >>> >>> So, my understanding is, all of my 4 executors will parallely run the >>> savePartition function (or closure) where they do not know which one should >>> write data before the other! >>> >>> In the savePartition method, in the comment, it says >>> "Saves a partition of a DataFrame to the JDBC database. This is done in >>> * a single database transaction in order to avoid repeatedly inserting >>> * data as much as possible." >>> >>> I want to understand, how these parallel executors save the partition >>> without harming the order of the results? Is it by locking the database >>> resource, from each executor (i.e. ex0 would first obtain a lock for the >>> table and write the partition0, while ex1 ... ex3 would wait till the lock >>> is released )? >>> >>> In my experience, there is no harm done to the order of the results at >>> the end of the day! >>> >>> Would like to hear from you guys! :-) >>> >>> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src >>> /main/scala/org/apache/spark/sql/execution/datasources/jdbc/ >>> JdbcUtils.scala#L277 >>> >>> -- >>> Niranda Perera >>> @n1r44 <https://twitter.com/N1R44> >>> <a href="tel:%2B94%2071%20554%208430" value="<a >>> href="tel:%2B94715548430" value="+94715548430 <071%20554%208430>" >>> target="_blank">+94715548430 <071%20554%208430>" target="_blank">+94 71 >>> 554 8430 >>> https://www.linkedin.com/in/niranda >>> https://pythagoreanscript.wordpress.com/ >>> >>> >>> -- >>> Best regards, >>> Maciej Szymkiewicz >>> >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/Ho >>> w-is-the-order-ensured-in-the-jdbc-relation-provider-when- >>> inserting-data-from-multiple-executors-tp19970p19971.html >>> To start a new topic under Apache Spark Developers List, email [hidden >>> email] <http:///user/SendEmail.jtp?type=node&node=19985&i=1> >>> To unsubscribe from Apache Spark Developers List, click here. >>> NAML >>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> >> -- >> Niranda Perera >> @n1r44 <https://twitter.com/N1R44> >> <a href="tel:%2B94%2071%20554%208430" value="+94715548430 >> <071%20554%208430>" target="_blank">+94 71 554 8430 <071%20554%208430> >> https://www.linkedin.com/in/niranda >> https://pythagoreanscript.wordpress.com/ >> >> ------------------------------ >> View this message in context: Re: How is the order ensured in the jdbc >> relation provider when inserting data from multiple executors >> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html> >> Sent from the Apache Spark Developers List mailing list archive >> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at >> Nabble.com. >> >> >> -- >> Maciej Szymkiewicz >> >> >> >> ------------------------------ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3.nabble.com/ >> How-is-the-order-ensured-in-the-jdbc-relation-provider- >> when-inserting-data-from-multiple-executors-tp19970p19986.html >> To start a new topic under Apache Spark Developers List, email [hidden >> email] <http:///user/SendEmail.jtp?type=node&node=20016&i=1> >> To unsubscribe from Apache Spark Developers List, click here. >> NAML >> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > > > -- > Niranda Perera > @n1r44 <https://twitter.com/N1R44> > +94 71 554 8430 <071%20554%208430> > https://www.linkedin.com/in/niranda > https://pythagoreanscript.wordpress.com/ > > ------------------------------ > View this message in context: Re: How is the order ensured in the jdbc > relation provider when inserting data from multiple executors > <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p20016.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Thanks, Sachith Withana