Re: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi, Thanks a lot. The problem is not do non-equal join for large tables, in fact, one table is really small and another one is huge. The problem is that spark can only get the correct size for dataframe created directly from hive table. Even we create a dataframe from local collection, it uses d

答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
We are using MR-based bulk loading on Spark. For filter pushdown, Astro does partition-pruning, scan range pruning, and use Gets as much as possible. Thanks, 发件人: Ted Malaska [mailto:ted.mala...@cloudera.com] 发送时间: 2015年8月12日 9:14 收件人: Yan Zhou.sc 抄送: dev@spark.apache.org; Bing Xiao (Bing); Te

RE: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread Cheng, Hao
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN. Currently, for the non-equal join, if the join type is the INNER join, then it will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the outer joins. In the BroadcastnestedLoopJoin, the table with

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
There a number of ways to bulk load. There is bulk put, partition bulk put, mr bulk load, and now hbase-14150 which is spark shuffle bulk load. Let me know if I have missed a bulk loading option. All these r possible with the new hbase-spark module. As for the filter push down discussion in the

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
No, Astro bulkloader does not use its own shuffle. But map/reduce-side processing is somewhat different from HBase’s bulk loader that are used by many HBase apps I believe. From: Ted Malaska [mailto:ted.mala...@cloudera.com] Sent: Wednesday, August 12, 2015 8:56 AM To: Yan Zhou.sc Cc: dev@spark.

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
The bulk load code is 14150 if u r interested. Let me know how it can be made faster. It's just a spark shuffle and writing hfiles. Unless astro wrote it's own shuffle the times should be very close. On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" wrote: > Ted, > > > > Thanks for pointing out more det

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
Ted, Thanks for pointing out more details of HBase-14181. I am afraid I may still need to learn more before I can make very accurate and pointed comments. As for filter push down, Astro has a powerful approach to basically break down arbitrarily complex logic expressions comprising of AND/OR/IN

Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Pala M Muthaia
Thanks for the pointers. Yes, i started with changing the hive.group property in pom and started seeing various dependency issues. Initially i thought spark-project.hive was just a pom for uber jars that pull in hive classes without transitive dependencies like kryo, but looks like lot more change

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
Hi Josh, I mean on the driver side. OutputCommitCorrdinator.startStage is called in DAGScheduler#submitMissingTasks for all the stages (cost some memory). Although it is fine that as long as executor side don't call RPC, there's no much performance penalty. On Wed, Aug 12, 2015 at 12:17 AM, Josh

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
Hey Yan, I've been the one building out this spark functionality in hbase so maybe I can help clarify. The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming. I and I believe the hbase team has no desire to build a

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
We have not “formally” published any numbers yet. A good reference is a slide deck we posted for the meetup in March. , or better yet for interested parties to run performance comparisons by themselves for now. As for status quo of Astro, we have been focusing on fixing bugs (UDF-related bug in

Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Yu
Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc wrote: > Finally I can take a look at HBASE-14181 now. Unfortunately there is no > design doc mentioned. Superficially it is very similar to Astro with a >

答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very similar to Astro with a difference of this being part of HBase client library; while Astro works as a Spark package so will evolve and function more closely with Spark SQL/Datafr

Re: PySpark on PyPi

2015-08-11 Thread westurner
westurner wrote > > Matt Goodman wrote >> I would tentatively suggest also conda packaging. >> >> http://conda.pydata.org/docs/ > $ conda skeleton pypi pyspark > # update git_tag and git_uri > # add test commands (import pyspark; import pyspark.[...]) > > Docs for building conda packages for mul

Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Steve Loughran
On 11 Aug 2015, at 12:25, Pala M Muthaia mailto:mchett...@rocketfuelinc.com>> wrote: Hi, I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We have some customizations in Hive (custom authorization, various hooks etc) that are all part of hive-exec. Given Spark's hive dep

Re: PySpark on PyPi

2015-08-11 Thread westurner
Matt Goodman wrote > I would tentatively suggest also conda packaging. > > http://conda.pydata.org/docs/ $ conda skeleton pypi pyspark # update git_tag and git_uri # add test commands (import pyspark; import pyspark.[...]) Docs for building conda packages for multiple operating systems and inter

Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Ted Yu
Have you looked at https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf ? Cheers On Tue, Aug 11, 2015 at 12:25 PM, Pala M Muthaia < mchett...@rocketfuelinc.com> wrote: > Hi, > > I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We > have some customizations in Hive (

Sources/pom for org.spark-project.hive

2015-08-11 Thread Pala M Muthaia
Hi, I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We have some customizations in Hive (custom authorization, various hooks etc) that are all part of hive-exec. Given Spark's hive dependency is through org.spark-project.hive groupId, looks like i need to modify the definit

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Josh Rosen
Can you clarify what you mean by "used for all stages"? OutputCommitCoordinator RPCs should only be initiated through SparkHadoopMapRedUtil.commitTask(), so while the OutputCommitCoordinator doesn't make a distinction between ShuffleMapStages and ResultStages there still should not be a perform

Re: Master JIRA ticket for tracking Spark 1.5.0 configuration renames, defaults changes, and configuration deprecation

2015-08-11 Thread Tom Graves
is there a jira for in compatibilities?  I was just trying spark 1.5 and it appears with dataframe aggregates (like sum) now return columns named sum(columname) where as in spark 1.4 it was SUM(columnname).  note the capital vs lower case. I wanted to check and make sure this was a known change?

Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi, Recently, I use spark sql to do join on non-equality condition, condition1 or condition2 for example. Spark will use broadcastNestedLoopJoin to do this. Assume that one of dataframe(df1) is not created from hive table nor local collection and the other one is created from hivetable(df2). For

Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi My Spark job (running in local[*] with spark 1.4.1) reads data from a thrift server(Created an RDD, it will compute the partitions in getPartitions() call and in computes hasNext will return records from these partitions), count(), foreach() is working fine it returns the correct number of reco

Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
As my understanding, OutputCommitCoordinator should only be necessary for ResultStage (especially for ResultStage with hdfs write), but currently it is used for all the stages. Is there any reason for that ? -- Best Regards Jeff Zhang

Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think. + dev list Thanks Best Regards On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon wrote: > Dear Sir / Madam, > > I have a plan to contribute some codes about passing filters to a > datasource as physical planning. > > In more

Re: Pushing Spark to 10Gb/s

2015-08-11 Thread Akhil Das
Hi Starch, It also depends on the applications behavior, some might not be properly able to utilize the network. If you are using say Kafka, then one thing that you should keep in mind is the Size of the individual message and the number of partitions that you are having. The higher the message si

Re: [discuss] Removing individual commit messages from the squash commit message

2015-08-11 Thread Reynold Xin
This is now done with this pull request: https://github.com/apache/spark/pull/8091 Committers please update the script to get this "feature". On Mon, Jul 20, 2015 at 12:28 AM, Manoj Kumar < manojkumarsivaraj...@gmail.com> wrote: > +1 > > Sounds like a great idea. > > On Sun, Jul 19, 2015 at 10

答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
Ok. Then a question will be to define a boundary between a query engine and a built-in processing. If, for instance, the Spark DataFrame functionalities involving shuffling are to be supported inside HBase, in my opinion, it’d be hard not to tag it as an query engine. If, on the other hand, only

Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Yu
HBase will not have query engine. It will provide better support to query engines. Cheers > On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc wrote: > > Ted, > > I’m in China now, and seem to experience difficulty to access Apache Jira. > Anyways, it appears to me that HBASE-14181 attempts to