Have these files the same schema. Probably yes Can they be read as an RDD each -> converted to DF and then registered as temporary tables and a UNION ALL on those temporary tables?
Alternatively if these files have different names, they can be put on the same HDFS staging sub-directory and read in one go. This is an example for csv files all in "/data/stg/table2" directory val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") HTH, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 1 March 2016 at 17:00, <andres.fernan...@wellsfargo.com> wrote: > Good day colleagues. Quick question on Parquet and Dataframes. Right now I > have the 4 parquet files stored in HDFS under the same path: > > /path/to/parquets/parquet1, /path/to/parquets/parquet2, > /path/to/parquets/parquet3, /path/to/parquets/parquet4… > > I want to perform a union on all this parquet files. Is there any other > way of doing this different to DataFrame’s unionAll? > > > > Thank you very much in advance. > > > > Andres Fernandez > > > > *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] > *Sent:* Tuesday, March 01, 2016 1:50 PM > *To:* Jeff Zhang > *Cc:* Yogesh Vyas; user@spark.apache.org > *Subject:* Re: Save DataFrame to Hive Table > > > > Hi > > > > It seems that your code is not specifying which database is your table > created > > > > Try this > > > > scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > > scala> // Choose a database > > scala> HiveContext.sql("show databases").show > > > > scala> HiveContext.sql("use test") // I chose test database > > scala> HiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT, > value STRING)") > > scala> HiveContext.sql("desc TableName").show > +--------+---------+-------+ > |col_name|data_type|comment| > +--------+---------+-------+ > | key| int| null| > | value| string| null| > +--------+---------+-------+ > > > > // create a simple DF > > > > Seq((1, "Mich"), (2, "James")) > > val b = a.toDF > > > > //Let me keep it simple. Create a temporary table and do a simple > insert/select. No need to convolute it > > > > b.registerTempTable("tmp") > > > > // Rember this temporaryTable is created in sql context NOT HiveContext/ > So HiveContext will NOT see that table > > // > > HiveContext.sql("INSERT INTO TableName SELECT * FROM tmp") > org.apache.spark.sql.AnalysisException: no such table tmp; line 1 pos 36 > > > > // This will work > > > > sql("INSERT INTO TableName SELECT * FROM tmp") > > > > sql("select count(1) from TableName").show > +---+ > |_c0| > +---+ > | 2| > +---+ > > > > HTH > > > > > Dr Mich Talebzadeh > > > > LinkedIn > *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > > > On 1 March 2016 at 06:33, Jeff Zhang <zjf...@gmail.com> wrote: > > The following line does not execute the sql so the table is not created. > Add .show() at the end to execute the sql. > > hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT, value > STRING)") > > > > On Tue, Mar 1, 2016 at 2:22 PM, Yogesh Vyas <informy...@gmail.com> wrote: > > Hi, > > I have created a DataFrame in Spark, now I want to save it directly > into the hive table. How to do it.? > > I have created the hive table using following hiveContext: > > HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc > ()); > hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key > INT, value STRING)"); > > I am using the following to save it into hive: > DataFrame.write().mode(SaveMode.Append).insertInto("TableName"); > > But it gives the error: > Exception in thread "main" java.lang.RuntimeException: Table Not > Found: TableName > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:266) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:264) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:254) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916) > at > org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918) > at > org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:917) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:921) > at > org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:921) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:176) > at > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164) > at com.honeywell.Track.combine.App.main(App.java:451) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > -- > > Best Regards > > Jeff Zhang > > >