Thanks all for your reply! I tested both approaches: registering the temp table then executing SQL vs. saving to HDFS filepath directly. The problem with the second approach is that I am inserting data into a Hive table, so if I create a new partition with this method, Hive metadata is not updated.
So I will be going with first approach. Follow up question in this case: what is the cost of registering a temp table? Is there a limit to the number of temp tables that can be registered by Spark context? Thanks again for your input. Isabelle On Wed, Dec 2, 2015 at 10:30 AM, Michael Armbrust <mich...@databricks.com> wrote: > you might also coalesce to 1 (or some small number) before writing to > avoid creating a lot of files in that partition if you know that there is > not a ton of data. > > On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra <rmis...@snappydata.io> > wrote: > >> As long as all your data is being inserted by Spark , hence using the >> same hash partitioner, what Fengdong mentioned should work. >> >> On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu <fengdo...@everstring.com> >> wrote: >> >>> Hi >>> you can try: >>> >>> if your table under location “/test/table/“ on HDFS >>> and has partitions: >>> >>> “/test/table/dt=2012” >>> “/test/table/dt=2013” >>> >>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") >>> >>> >>> >>> On Dec 2, 2015, at 10:50 AM, Isabelle Phan <nlip...@gmail.com> wrote: >>> >>> df.write.partitionBy("date").insertInto("my_table") >>> >>> >>> >> >> >> -- >> Regards, >> Rishitesh Mishra, >> SnappyData . (http://www.snappydata.io/) >> >> https://in.linkedin.com/in/rishiteshmishra >> > >