you might first write the code to construct query statement with "union
all"  like below:

scala> val query="select * from dfv1 union all select * from dfv2 union all
select * from dfv3"
query: String = select * from dfv1 union all select * from dfv2 union all
select * from dfv3

then write loop to register each partition to a view like below:
 for (i <- 1 to 3){
      df.createOrReplaceTempView("dfv"+i)
      }

scala> spark.sql(query).explain
== Physical Plan ==
Union
:- LocalTableScan [_1#0, _2#1, _3#2]
:- LocalTableScan [_1#0, _2#1, _3#2]
+- LocalTableScan [_1#0, _2#1, _3#2]


You can use " roll up" or "group set" for multiple dimension  to replace
"union" or "union all"

On Tue, Nov 20, 2018 at 8:34 PM onmstester onmstester
<onmstes...@zoho.com.invalid> wrote:

> I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've
> partitioned my data with time bucket and one id, so based on queries i need
> to union multiple partitions with spark-sql and do the
> aggregations/group-by on union-result, something like this:
>
> for(all cassandra partitions){
> DataSet<Row> currentPartition = sqlContext.sql(....);
> unionResult = unionResult.union(currentPartition);
> }
>
> Increasing input (number of loaded partitions), increases response time
> more than linearly because unions would be done sequentialy.
>
> Because there is no harm in doing unions in parallel, and i dont know how
> to force spark to do them in parallel, Right now i'm using a ThreadPool to
> Asyncronosly load all partitions in my application (which may cause OOM),
> and somehow do the sort or simple group by in java (Which make me think why
> even i'm using spark at all?)
>
> The short question is: How to force spark-sql to load cassandra partitions
> in parallel while doing union on them? Also I don't want too many tasks in
> spark, with my Home-Made Async solution, i use coalesece(1) so one task is
> so fast (only wait time on casandra).
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>

Reply via email to