Have these files the same schema. Probably yes

Can they be read as an RDD each -> converted to DF and then registered as
temporary tables and a UNION ALL on those temporary tables?

Alternatively if these files have different names, they can be put on the
same HDFS staging sub-directory and read in one go. This is an example for
csv files all in "/data/stg/table2" directory

val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load("/data/stg/table2")


HTH,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 1 March 2016 at 17:00, <andres.fernan...@wellsfargo.com> wrote:

> Good day colleagues. Quick question on Parquet and Dataframes. Right now I
> have the 4 parquet files stored in HDFS under the same path:
>
> /path/to/parquets/parquet1, /path/to/parquets/parquet2,
> /path/to/parquets/parquet3, /path/to/parquets/parquet4…
>
> I want to perform a union on all this parquet files. Is there any other
> way of doing this different to DataFrame’s unionAll?
>
>
>
> Thank you very much in advance.
>
>
>
> Andres Fernandez
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, March 01, 2016 1:50 PM
> *To:* Jeff Zhang
> *Cc:* Yogesh Vyas; user@spark.apache.org
> *Subject:* Re: Save DataFrame to Hive Table
>
>
>
> Hi
>
>
>
> It seems that your code is not specifying which database is your table
> created
>
>
>
> Try this
>
>
>
> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> scala> // Choose a database
>
> scala> HiveContext.sql("show databases").show
>
>
>
> scala> HiveContext.sql("use test")  // I chose test database
>
> scala> HiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT,
> value STRING)")
>
> scala> HiveContext.sql("desc TableName").show
> +--------+---------+-------+
> |col_name|data_type|comment|
> +--------+---------+-------+
> |     key|      int|   null|
> |   value|   string|   null|
> +--------+---------+-------+
>
>
>
> // create a simple DF
>
>
>
> Seq((1, "Mich"), (2, "James"))
>
> val b = a.toDF
>
>
>
> //Let me keep it simple. Create a temporary table and do a simple
> insert/select. No need to convolute it
>
>
>
> b.registerTempTable("tmp")
>
>
>
> // Rember this temporaryTable is created in sql context NOT HiveContext/
> So HiveContext will NOT see that table
>
> //
>
> HiveContext.sql("INSERT INTO TableName SELECT * FROM tmp")
> org.apache.spark.sql.AnalysisException: no such table tmp; line 1 pos 36
>
>
>
> // This will work
>
>
>
> sql("INSERT INTO TableName SELECT * FROM tmp")
>
>
>
> sql("select count(1) from TableName").show
> +---+
> |_c0|
> +---+
> |  2|
> +---+
>
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 1 March 2016 at 06:33, Jeff Zhang <zjf...@gmail.com> wrote:
>
> The following line does not execute the sql so the table is not created.
> Add .show() at the end to execute the sql.
>
> hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT, value
> STRING)")
>
>
>
> On Tue, Mar 1, 2016 at 2:22 PM, Yogesh Vyas <informy...@gmail.com> wrote:
>
> Hi,
>
> I have created a DataFrame in Spark, now I want to save it directly
> into the hive table. How to do it.?
>
> I have created the hive table using following hiveContext:
>
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc
> ());
>         hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key
> INT, value STRING)");
>
> I am using the following to save it into hive:
> DataFrame.write().mode(SaveMode.Append).insertInto("TableName");
>
> But it gives the error:
> Exception in thread "main" java.lang.RuntimeException: Table Not
> Found: TableName
>         at scala.sys.package$.error(package.scala:27)
>         at
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:266)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264)
>         at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
>         at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
>         at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>         at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:264)
>         at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:254)
>         at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
>         at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
>         at
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>         at scala.collection.immutable.List.foldLeft(List.scala:84)
>         at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
>         at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
>         at scala.collection.immutable.List.foreach(List.scala:318)
>         at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:917)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:921)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:921)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>         at
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:176)
>         at
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164)
>         at com.honeywell.Track.combine.App.main(App.java:451)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>         at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>         at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
> --
>
> Best Regards
>
> Jeff Zhang
>
>
>

Reply via email to