Re: When will Spark Streaming supports Kafka-simple consumer API?

2015-02-05 Thread Xuelin Cao.2015
rpose. JIRA - https://issues.apache.org/jira/browse/SPARK-4964 > > Can you elaborate on why you have to use SimpleConsumer in your > environment? > > TD > > > On Wed, Feb 4, 2015 at 7:44 PM, Xuelin Cao <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=104

When will Spark Streaming supports Kafka-simple consumer API?

2015-02-04 Thread Xuelin Cao
Hi, In our environment, Kafka can only be used with simple consumer API, like storm spout does. And, also, I found there are suggestions that " Kafka connector of Spark should not be used in production because it is based on the high-level

Can spark provide an option to start reduce stage early?

2015-02-02 Thread Xuelin Cao
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are completed. By doing this, the data shuffling process is able to parallel with the map process. In a large multi-tenancy cluster, this option is usually tuned off.

Re: Will Spark-SQL support vectorized query engine someday?

2015-01-20 Thread Xuelin Cao
compression encoding. For example, one can turn > string comparisons into integer comparisons. These will probably give much > larger performance improvements in common queries. > > > On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao > wrote: > >> Hi, >> >> Cor

Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Xuelin Cao
Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child->execute . There are papers that illustrate the benefits of vectorized query engine. And Hiv

Re: When will spark support "push" style shuffle?

2015-01-07 Thread Xuelin Cao.2015
Got it. The explain makes sense. Thank you. On Thu, Jan 8, 2015 at 1:06 PM, Patrick Wendell [via Apache Spark Developers List] wrote: > This question is conflating a few different concepts. I think the main > question is whether Spark will have a shuffle implementation that > streams data rathe

When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says "Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)"      For our

Re: Why Executor Deserialize Time takes more than 300ms?

2014-11-22 Thread Xuelin Cao
Thanks Imran, The problems is, *every time* I run the same task, the deserialization time is around 300~500ms. I don't know if this is a normal case. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Executor-Deserialize-Time-takes-more-than

Why Executor Deserialize Time takes more than 300ms?

2014-11-22 Thread Xuelin Cao
In our experimental cluster (1 driver, 5 workers), we tried the simplest example:   sc.parallelize(Range(0, 100), 2).count  In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms.  Our servers are with 2.3G Hz CPU *

Why Executor Deserialize Time takes more than 300ms?

2014-11-21 Thread Xuelin Cao
In our experimental cluster (1 driver, 5 workers), we tried the simplest example: sc.parallelize(Range(0, 100), 2).count In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms. Our servers are with 2.3G Hz CPU * 24