you might first write the code to construct query statement with "union all" like below:
scala> val query="select * from dfv1 union all select * from dfv2 union all select * from dfv3" query: String = select * from dfv1 union all select * from dfv2 union all select * from dfv3 then write loop to register each partition to a view like below: for (i <- 1 to 3){ df.createOrReplaceTempView("dfv"+i) } scala> spark.sql(query).explain == Physical Plan == Union :- LocalTableScan [_1#0, _2#1, _3#2] :- LocalTableScan [_1#0, _2#1, _3#2] +- LocalTableScan [_1#0, _2#1, _3#2] You can use " roll up" or "group set" for multiple dimension to replace "union" or "union all" On Tue, Nov 20, 2018 at 8:34 PM onmstester onmstester <onmstes...@zoho.com.invalid> wrote: > I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've > partitioned my data with time bucket and one id, so based on queries i need > to union multiple partitions with spark-sql and do the > aggregations/group-by on union-result, something like this: > > for(all cassandra partitions){ > DataSet<Row> currentPartition = sqlContext.sql(....); > unionResult = unionResult.union(currentPartition); > } > > Increasing input (number of loaded partitions), increases response time > more than linearly because unions would be done sequentialy. > > Because there is no harm in doing unions in parallel, and i dont know how > to force spark to do them in parallel, Right now i'm using a ThreadPool to > Asyncronosly load all partitions in my application (which may cause OOM), > and somehow do the sort or simple group by in java (Which make me think why > even i'm using spark at all?) > > The short question is: How to force spark-sql to load cassandra partitions > in parallel while doing union on them? Also I don't want too many tasks in > spark, with my Home-Made Async solution, i use coalesece(1) so one task is > so fast (only wait time on casandra). > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > >