Hi,
Thanks a lot.
The problem is not do non-equal join for large tables, in fact, one table
is really small and another one is huge.
The problem is that spark can only get the correct size for dataframe
created directly from hive table. Even we create a dataframe from local
collection, it uses d
We are using MR-based bulk loading on Spark.
For filter pushdown, Astro does partition-pruning, scan range pruning, and use
Gets as much as possible.
Thanks,
发件人: Ted Malaska [mailto:ted.mala...@cloudera.com]
发送时间: 2015年8月12日 9:14
收件人: Yan Zhou.sc
抄送: dev@spark.apache.org; Bing Xiao (Bing); Te
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN.
Currently, for the non-equal join, if the join type is the INNER join, then it
will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the
outer joins.
In the BroadcastnestedLoopJoin, the table with
There a number of ways to bulk load.
There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
which is spark shuffle bulk load.
Let me know if I have missed a bulk loading option. All these r possible
with the new hbase-spark module.
As for the filter push down discussion in the
No, Astro bulkloader does not use its own shuffle. But map/reduce-side
processing is somewhat different from HBase’s bulk loader that are used by many
HBase apps I believe.
From: Ted Malaska [mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc: dev@spark.
The bulk load code is 14150 if u r interested. Let me know how it can be
made faster.
It's just a spark shuffle and writing hfiles. Unless astro wrote it's own
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" wrote:
> Ted,
>
>
>
> Thanks for pointing out more det
Ted,
Thanks for pointing out more details of HBase-14181. I am afraid I may still
need to learn more before I can make very accurate and pointed comments.
As for filter push down, Astro has a powerful approach to basically break down
arbitrarily complex logic expressions comprising of AND/OR/IN
Thanks for the pointers. Yes, i started with changing the hive.group
property in pom and started seeing various dependency issues.
Initially i thought spark-project.hive was just a pom for uber jars that
pull in hive classes without transitive dependencies like kryo, but looks
like lot more change
Hi Josh,
I mean on the driver side. OutputCommitCorrdinator.startStage is called in
DAGScheduler#submitMissingTasks for all the stages (cost some memory).
Although it is fine that as long as executor side don't call RPC, there's
no much performance penalty.
On Wed, Aug 12, 2015 at 12:17 AM, Josh
Hey Yan,
I've been the one building out this spark functionality in hbase so maybe I
can help clarify.
The hbase-spark module is just focused on making spark integration with
hbase easy and out of the box for both spark and spark streaming.
I and I believe the hbase team has no desire to build a
We have not “formally” published any numbers yet. A good reference is a slide
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by
themselves for now.
As for status quo of Astro, we have been focusing on fixing bugs (UDF-related
bug in
Yan:
Where can I find performance numbers for Astro (it's close to middle of
August) ?
Cheers
On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc wrote:
> Finally I can take a look at HBASE-14181 now. Unfortunately there is no
> design doc mentioned. Superficially it is very similar to Astro with a
>
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package
so will evolve and function more closely with Spark SQL/Datafr
westurner wrote
>
> Matt Goodman wrote
>> I would tentatively suggest also conda packaging.
>>
>> http://conda.pydata.org/docs/
> $ conda skeleton pypi pyspark
> # update git_tag and git_uri
> # add test commands (import pyspark; import pyspark.[...])
>
> Docs for building conda packages for mul
On 11 Aug 2015, at 12:25, Pala M Muthaia
mailto:mchett...@rocketfuelinc.com>> wrote:
Hi,
I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We have
some customizations in Hive (custom authorization, various hooks etc) that are
all part of hive-exec.
Given Spark's hive dep
Matt Goodman wrote
> I would tentatively suggest also conda packaging.
>
> http://conda.pydata.org/docs/
$ conda skeleton pypi pyspark
# update git_tag and git_uri
# add test commands (import pyspark; import pyspark.[...])
Docs for building conda packages for multiple operating systems and
inter
Have you looked at
https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf ?
Cheers
On Tue, Aug 11, 2015 at 12:25 PM, Pala M Muthaia <
mchett...@rocketfuelinc.com> wrote:
> Hi,
>
> I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We
> have some customizations in Hive (
Hi,
I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We
have some customizations in Hive (custom authorization, various hooks etc)
that are all part of hive-exec.
Given Spark's hive dependency is through org.spark-project.hive groupId,
looks like i need to modify the definit
Can you clarify what you mean by "used for all stages"?
OutputCommitCoordinator RPCs should only be initiated through
SparkHadoopMapRedUtil.commitTask(), so while the OutputCommitCoordinator
doesn't make a distinction between ShuffleMapStages and ResultStages
there still should not be a perform
is there a jira for in compatibilities? I was just trying spark 1.5 and it
appears with dataframe aggregates (like sum) now return columns named
sum(columname) where as in spark 1.4 it was SUM(columnname). note the capital
vs lower case.
I wanted to check and make sure this was a known change?
Hi,
Recently, I use spark sql to do join on non-equality condition, condition1
or condition2 for example.
Spark will use broadcastNestedLoopJoin to do this. Assume that one of
dataframe(df1) is not created from hive table nor local collection and the
other one is created from hivetable(df2). For
Hi
My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of reco
As my understanding, OutputCommitCoordinator should only be necessary for
ResultStage (especially for ResultStage with hdfs write), but currently it
is used for all the stages. Is there any reason for that ?
--
Best Regards
Jeff Zhang
You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon wrote:
> Dear Sir / Madam,
>
> I have a plan to contribute some codes about passing filters to a
> datasource as physical planning.
>
> In more
Hi Starch,
It also depends on the applications behavior, some might not be properly
able to utilize the network. If you are using say Kafka, then one thing
that you should keep in mind is the Size of the individual message and the
number of partitions that you are having. The higher the message si
This is now done with this pull request:
https://github.com/apache/spark/pull/8091
Committers please update the script to get this "feature".
On Mon, Jul 20, 2015 at 12:28 AM, Manoj Kumar <
manojkumarsivaraj...@gmail.com> wrote:
> +1
>
> Sounds like a great idea.
>
> On Sun, Jul 19, 2015 at 10
Ok. Then a question will be to define a boundary between a query engine and a
built-in processing. If, for instance, the Spark DataFrame functionalities
involving shuffling are to be supported inside HBase,
in my opinion, it’d be hard not to tag it as an query engine. If, on the other
hand, only
HBase will not have query engine.
It will provide better support to query engines.
Cheers
> On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc wrote:
>
> Ted,
>
> I’m in China now, and seem to experience difficulty to access Apache Jira.
> Anyways, it appears to me that HBASE-14181 attempts to
28 matches
Mail list logo