Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
Sometimes, the BUF for the aggregator may depend on the actual input.. and while this passes the responsibility to handle null in merge/reduce to the developer, it sounds fine to me if he is the one who put null in zero() anyway. Now, it seems that the aggregation is skipped entirely when zero() =

Running of Continuous Aggregation example

2016-06-26 Thread Chang Lim
Has anyone been able to run the code in The Future of Real-Time in Spark Slide 24 :"Continuous Aggregation"? Specifically, the line: stream("jdbc:mysql//..."), Using Spark 2.0 preview build, I am getting the error when writi

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
Hi, This behaviour seems to be expected because you must ensure `b + zero() = b` The your case `b + null = null` breaks this rule. This is the same with v1.6.1. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57 // maropu

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
Not sure about what's the rule in case of `b + null = null` but the same code works perfectly in 1.6.1, just tried it.. On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro wrote: > Hi, > > This behaviour seems to be expected because you must ensure `b + zero() = > b` > The your case `b + null = nul

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
Whatever it is, this is expected; if an initial value is null, spark codegen removes all the aggregates. See: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199 // maropu On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
This "if (value == null)" condition you point to exists in 1.6 branch as well, so that's probably not the reason. On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro wrote: > Whatever it is, this is expected; if an initial value is null, spark > codegen removes all the aggregates. > See: > https://

RE: Logging trait in Spark 2.0

2016-06-26 Thread Paolo Patierno
Yes ... the same here ... I'd like to know the best way for adding logging in a custom receiver for Spark Streaming 2.0 Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor Twitter : @ppatierno Linkedin : paolopatierno Blog : DevEx

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
No, TypedAggregateExpression that uses Aggregator#zero is different between v2.0 and v1.6. v2.0: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L91 v1.6: https://github.com/apache/spark/blob/branch-1.6/sql/

add multiple columns

2016-06-26 Thread pseudo oduesp
Hi who i can add multiple columns to data frame withcolumns allow to add one columns but when you have multiple i have to loop on eache columns ? thanks

alter table with hive context

2016-06-26 Thread pseudo oduesp
Hi, how i can alter table by adiing new columns to table in hivecontext ?

Re: add multiple columns

2016-06-26 Thread ndjido
Hi guy! I'm afraid you have to loop...The update of the Logical Plan is getting faster on Spark. Cheers, Ardo. Sent from my iPhone > On 26 Jun 2016, at 14:20, pseudo oduesp wrote: > > Hi who i can add multiple columns to data frame > > withcolumns allow to add one columns but when you h

Re: add multiple columns

2016-06-26 Thread ayan guha
Can you share an example? You may want to write a sql stmt to add the columns>? On Sun, Jun 26, 2016 at 11:02 PM, wrote: > Hi guy! > > I'm afraid you have to loop...The update of the Logical Plan is getting > faster on Spark. > > Cheers, > Ardo. > > Sent from my iPhone > > > On 26 Jun 2016, at 1

Re: alter table with hive context

2016-06-26 Thread Mich Talebzadeh
-- create the hivecontext scala> *val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)*HiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@6387fb09 --use the test dastabase scala> *HiveContext.sql("use test")*res8: org.apache.spark.sql.DataFrame

Running JavaBased Implementation of StreamingKmeans Spark

2016-06-26 Thread Biplob Biswas
Hi, Something is wrong with my spark subscription so I can't see the responses properly on nabble, so I subscribed from a different id, hopefully it is solved and I am putting my question again here. I implemented the streamingKmeans example provided in the spark website but in Java. The full imp

What is the explanation of "ConvertToUnsafe" in "Physical Plan"

2016-06-26 Thread Mich Talebzadeh
Hi, In Spark's Physical Plan what is the explanation for ConvertToUnsafe? Example: scala> sorted.filter($"prod_id" ===13).explain == Physical Plan == Filter (prod_id#10L = 13) +- Sort [prod_id#10L ASC,cust_id#11L ASC,time_id#12 ASC,channel_id#13L ASC,promo_id#14L ASC], true, 0 +- ConvertToUns

Spark 1.6.1: Unexpected partition behavior?

2016-06-26 Thread Randy Gelhausen
val enriched_web_logs = sqlContext.sql(""" select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as source_host, log from web_logs left outer join (select distinct node, address from nodes) b on source_ip = address """) enriched_web_logs.coalesce(1).write.format("parquet").mode("o

Re: Spark 1.6.1: Unexpected partition behavior?

2016-06-26 Thread Randy Gelhausen
Sorry, please ignore the above. I now see I called coalesce on a different reference, than I used to register the table. On Sun, Jun 26, 2016 at 6:34 PM, Randy Gelhausen wrote: > > val enriched_web_logs = sqlContext.sql(""" > select web_logs.datetime, web_logs.node as app_host, source_ip, b.no

Re: Spark 2.0 Streaming and Event Time

2016-06-26 Thread Chang Lim
Here is an update to my question: = Tathagata Das Jun 9 to me Event time is part of windowed aggregation API. See my slides - https://www.slideshare.net/mobile/databricks/a-deep-dive-into-structured-streaming Let me know if it helps you to find it. Keeping it short as I am o

Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread sychungd
Hi, Thanks for reply. It's a java web service resides in a jboss container. HY Chung Best regards, S.Y. Chung 鍾學毅 F14MITD Taiwan Semiconductor Manufacturing Company, Ltd. Tel: 06-5056688 Ext: 734-6325 |-> |Mich Talebzadeh | |

Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread Saisai Shao
It means several jars are missing in the yarn container environment, if you want to submit your application through some other ways besides spark-submit, you have to take care all the environment things yourself. Since we don't know your implementation of java web service, so it is hard to provide

How to convert a Random Forest model built in R to a similar model in Spark

2016-06-26 Thread Neha Mehta
Hi All, Request help with problem mentioned in the mail below. I have an existing random forest model in R which needs to be deployed on Spark. I am trying to recreate the model in Spark but facing the problem mentioned below. Thanks, Neha On Jun 24, 2016 5:10 PM, wrote: > > Hi Sun, > > I am try

Unsubscribe

2016-06-26 Thread kchen
Unsubscribe

Difference between Dataframe and RDD Persisting

2016-06-26 Thread Brandon White
What is the difference between persisting a dataframe and a rdd? When I persist my RDD, the UI says it takes 50G or more of memory. When I persist my dataframe, the UI says it takes 9G or less of memory. Does the dataframe not persist the actual content? Is it better / faster to persist a RDD when

Re: Spark Thrift Server Concurrency

2016-06-26 Thread Prabhu Joseph
Spark Thrift Server is started with ./sbin/start-thriftserver.sh --master yarn-client --hiveconf hive.server2.thrift.port=10001 --num-executors 4 --executor-cores 2 --executor-memory 4G --conf spark.scheduler.mode=FAIR 20 parallel below queries are executed select distinct val2 from philips1 whe

Re: Difference between Dataframe and RDD Persisting

2016-06-26 Thread Jörn Franke
Dataframe uses a more efficient binary representation to store and persist data. You should go for that one in most of the cases. Rdd is slower. > On 27 Jun 2016, at 07:54, Brandon White wrote: > > What is the difference between persisting a dataframe and a rdd? When I > persist my RDD, the UI