Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-11 Thread fightf...@163.com
Hi,

Really have no adequate solution got for this issue. Expecting any available 
analytical rules or hints.

Thanks,
Sun.



fightf...@163.com
 
From: fightf...@163.com
Date: 2015-02-09 11:56
To: user; dev
Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for 
large data sets
Hi,
Problem still exists. Any experts would take a look at this? 

Thanks,
Sun.



fightf...@163.com
 
From: fightf...@163.com
Date: 2015-02-06 17:54
To: user; dev
Subject: Sort Shuffle performance issues about using AppendOnlyMap for large 
data sets
Hi, all
Recently we had caught performance issues when using spark 1.2.0 to read data 
from hbase and do some summary work.
Our scenario means to : read large data sets from hbase (maybe 100G+ file) , 
form hbaseRDD, transform to schemardd, 
groupby and aggregate the data while got fewer new summary data sets, loading 
data into hbase (phoenix).

Our major issue lead to : aggregate large datasets to get summary data sets 
would consume too long time (1 hour +) , while that
should be supposed not so bad performance. We got the dump file attached and 
stacktrace from jstack like the following:

From the stacktrace and dump file we can identify that processing large 
datasets would cause frequent AppendOnlyMap growing, and 
leading to huge map entrysize. We had referenced the source code of 
org.apache.spark.util.collection.AppendOnlyMap and found that 
the map had been initialized with capacity of 64. That would be too small for 
our use case. 

So the question is : Does anyone had encounted such issues before? How did that 
be resolved? I cannot find any jira issues for such problems and 
if someone had seen, please kindly let us know.

More specified solution would goes to : Does any possibility exists for user 
defining the map capacity releatively in spark? If so, please
tell how to achieve that. 

Best Thanks,
Sun.

   Thread 22432: (state = IN_JAVA)
- org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 
(Compiled frame; information may be imprecise)
- org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
@bci=1, line=38 (Interpreted frame)
- org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
line=198 (Compiled frame)
- org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
scala.Function2) @bci=201, line=145 (Compiled frame)
- 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
- 
org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
- 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=169, line=68 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=41 (Interpreted frame)
- org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame)
- org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
(Interpreted frame)
- 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
- java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


Thread 22431: (state = IN_JAVA)
- org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 
(Compiled frame; information may be imprecise)
- org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
@bci=1, line=38 (Interpreted frame)
- org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
line=198 (Compiled frame)
- org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
scala.Function2) @bci=201, line=145 (Compiled frame)
- 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
- 
org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
- 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=169, line=68 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=41 (Interpreted frame)
- org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame)
- org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
(Interpreted frame)
- 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread fightf...@163.com
Hi, patrick

Really glad to get your reply. 
Yes, we are doing group by operations for our work. We know that this is common 
for growTable when processing large data sets.

The problem actually goes to : Do we have any possible chance to self-modify 
the initialCapacity using specifically for our 
application? Does spark provide such configs for achieving that goal? 

We know that this is trickle to get it working. Just want to know that how 
could this be resolved, or from other possible channel for
we did not cover.

Expecting for your kind advice.

Thanks,
Sun.



fightf...@163.com
 
From: Patrick Wendell
Date: 2015-02-12 16:12
To: fightf...@163.com
CC: user; dev
Subject: Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for 
large data sets
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.
 
- Patrick
 
On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com  wrote:
> Hi,
>
> Really have no adequate solution got for this issue. Expecting any available
> analytical rules or hints.
>
> Thanks,
> Sun.
>
> ________
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-09 11:56
> To: user; dev
> Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
> large data sets
> Hi,
> Problem still exists. Any experts would take a look at this?
>
> Thanks,
> Sun.
>
> 
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-06 17:54
> To: user; dev
> Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
> data sets
> Hi, all
> Recently we had caught performance issues when using spark 1.2.0 to read
> data from hbase and do some summary work.
> Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
> form hbaseRDD, transform to schemardd,
> groupby and aggregate the data while got fewer new summary data sets,
> loading data into hbase (phoenix).
>
> Our major issue lead to : aggregate large datasets to get summary data sets
> would consume too long time (1 hour +) , while that
> should be supposed not so bad performance. We got the dump file attached and
> stacktrace from jstack like the following:
>
> From the stacktrace and dump file we can identify that processing large
> datasets would cause frequent AppendOnlyMap growing, and
> leading to huge map entrysize. We had referenced the source code of
> org.apache.spark.util.collection.AppendOnlyMap and found that
> the map had been initialized with capacity of 64. That would be too small
> for our use case.
>
> So the question is : Does anyone had encounted such issues before? How did
> that be resolved? I cannot find any jira issues for such problems and
> if someone had seen, please kindly let us know.
>
> More specified solution would goes to : Does any possibility exists for user
> defining the map capacity releatively in spark? If so, please
> tell how to achieve that.
>
> Best Thanks,
> Sun.
>
>Thread 22432: (state = IN_JAVA)
> - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
> line=224 (Compiled frame; information may be imprecise)
> - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
> @bci=1, line=38 (Interpreted frame)
> - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
> line=198 (Compiled frame)
> -
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=201, line=145 (Compiled frame)
> -
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=3, line=32 (Compiled frame)
> -
> org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
> @bci=141, line=205 (Compiled frame)
> -
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
> @bci=74, line=58 (Interpreted frame)
> -
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
> @bci=169, line=68 (Interpreted frame)
> -
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
> @bci=2, line=41 (Interpreted frame)
> - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted
> frame)
> - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196
> (Interpreted frame)
> -
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
> @bci=95, line=1145 (Interpreted frame)
&g

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread fightf...@163.com
Hi, there
Which version are you using ? Actually the problem seems gone after we change 
our spark version from 1.2.0 to 1.3.0 

Not sure what the internal changes did.

Best,
Sun.



fightf...@163.com
 
From: Night Wolf
Date: 2015-05-12 22:05
To: fightf...@163.com
CC: Patrick Wendell; user; dev
Subject: Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for 
large data sets
Seeing similar issues, did you find a solution? One would be to increase the 
number of partitions if you're doing lots of object creation. 

On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com  wrote:
Hi, patrick

Really glad to get your reply. 
Yes, we are doing group by operations for our work. We know that this is common 
for growTable when processing large data sets.

The problem actually goes to : Do we have any possible chance to self-modify 
the initialCapacity using specifically for our 
application? Does spark provide such configs for achieving that goal? 

We know that this is trickle to get it working. Just want to know that how 
could this be resolved, or from other possible channel for
we did not cover.

Expecting for your kind advice.

Thanks,
Sun.



fightf...@163.com
 
From: Patrick Wendell
Date: 2015-02-12 16:12
To: fightf...@163.com
CC: user; dev
Subject: Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for 
large data sets
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.
 
- Patrick
 
On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com  wrote:
> Hi,
>
> Really have no adequate solution got for this issue. Expecting any available
> analytical rules or hints.
>
> Thanks,
> Sun.
>
> ________________
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-09 11:56
> To: user; dev
> Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
> large data sets
> Hi,
> Problem still exists. Any experts would take a look at this?
>
> Thanks,
> Sun.
>
> 
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-06 17:54
> To: user; dev
> Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
> data sets
> Hi, all
> Recently we had caught performance issues when using spark 1.2.0 to read
> data from hbase and do some summary work.
> Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
> form hbaseRDD, transform to schemardd,
> groupby and aggregate the data while got fewer new summary data sets,
> loading data into hbase (phoenix).
>
> Our major issue lead to : aggregate large datasets to get summary data sets
> would consume too long time (1 hour +) , while that
> should be supposed not so bad performance. We got the dump file attached and
> stacktrace from jstack like the following:
>
> From the stacktrace and dump file we can identify that processing large
> datasets would cause frequent AppendOnlyMap growing, and
> leading to huge map entrysize. We had referenced the source code of
> org.apache.spark.util.collection.AppendOnlyMap and found that
> the map had been initialized with capacity of 64. That would be too small
> for our use case.
>
> So the question is : Does anyone had encounted such issues before? How did
> that be resolved? I cannot find any jira issues for such problems and
> if someone had seen, please kindly let us know.
>
> More specified solution would goes to : Does any possibility exists for user
> defining the map capacity releatively in spark? If so, please
> tell how to achieve that.
>
> Best Thanks,
> Sun.
>
>Thread 22432: (state = IN_JAVA)
> - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
> line=224 (Compiled frame; information may be imprecise)
> - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
> @bci=1, line=38 (Interpreted frame)
> - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
> line=198 (Compiled frame)
> -
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=201, line=145 (Compiled frame)
> -
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=3, line=32 (Compiled frame)
> -
> org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
> @bci=141, line=205 (Compiled frame)
> -
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
> @bci=74, line=58 (Interp

OLAP query using spark dataframe with cassandra

2015-11-08 Thread fightf...@163.com
Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com


Re: Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread fightf...@163.com
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark 
sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction 
choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightf...@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightf...@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com


Re: Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread fightf...@163.com
Hi,

According to my experience, I would recommend option 3) using Apache Kylin for 
your requirements. 

This is a suggestion based on the open-source world. 

For the per cassandra thing, I accept your advice for the special support 
thing. But the community is very

open and convinient for prompt response. 



fightf...@163.com
 
From: tsh
Date: 2015-11-10 02:56
To: fightf...@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,

I'm in the same position right now: we are going to implement something like 
OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of 
versatile data and the necessity to make something like cubes (Hive and Hive on 
HBase are unsatisfactory). From the other, our users get accustomed to Tableau 
+ Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume 
(has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, 
you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of 
the process too.

Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightf...@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark 
sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction 
choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightf...@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightf...@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com



Re: Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread fightf...@163.com
Hi, 

Have you ever considered cassandra as a replacement ? We are now almost the 
seem usage as your engine, e.g. using mysql to store 

initial aggregated data. Can you share more about your kind of Cube queries ? 
We are very interested in that arch too : )

Best,
Sun.


fightf...@163.com
 
From: Andrés Ivaldi
Date: 2015-11-10 07:03
To: tsh
CC: fightf...@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,
I'm also considering something similar, Spark plain is too slow for my case, a 
possible solution is use Spark as Multiple Source connector and basic 
transformation layer, then persist the information (actually is a RDBM), after 
that with our engine we build a kind of Cube queries, and the result is 
processed again by Spark adding Machine Learning.
Our Missing part is reemplace the RDBM with something more suitable and 
scalable than RDBM, dont care about pre processing information if after pre 
processing the queries are fast.

Regards

On Mon, Nov 9, 2015 at 3:56 PM, tsh  wrote:
Hi,

I'm in the same position right now: we are going to implement something like 
OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of 
versatile data and the necessity to make something like cubes (Hive and Hive on 
HBase are unsatisfactory). From the other, our users get accustomed to Tableau 
+ Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume 
(has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, 
you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of 
the process too.

Sincerely yours, Tim Shenkao


On 11/09/2015 09:46 AM, fightf...@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark 
sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction 
choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightf...@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightf...@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com




-- 
Ing. Ivaldi Andres