Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Yu
HBase will not have query engine. 

It will provide better support to query engines. 

Cheers



> On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc  wrote:
> 
> Ted,
>  
> I’m in China now, and seem to experience difficulty to access Apache Jira. 
> Anyways, it appears to me  that HBASE-14181 attempts to support Spark 
> DataFrame inside HBase.
> If true, one question to me is whether HBase is intended to have a built-in 
> query engine or not. Or it will stick with the current way as
> a k-v store with some built-in processing capabilities in the forms of 
> coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
> engines
> built on top of it.
>  
> Thanks,
>  
> 发件人: Ted Yu [mailto:yuzhih...@gmail.com] 
> 发送时间: 2015年8月11日 8:54
> 收件人: Bing Xiao (Bing)
> 抄送: dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
> 主题: Re: Package Release Annoucement: Spark SQL on HBase "Astro"
>  
> Yan / Bing:
> Mind taking a look at HBASE-14181 'Add Spark DataFrame DataSource to 
> HBase-Spark Module' ?
>  
> Thanks
>  
> On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing)  
> wrote:
> We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
> release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
> The main features in this package, dubbed “Astro”, include:
> · Systematic and powerful handling of data pruning and intelligent 
> scan, based on partial evaluation technique
> 
> · HBase pushdown capabilities like custom filters and coprocessor to 
> support ultra low latency processing
> 
> · SQL, Data Frame support
> 
> · More SQL capabilities made possible (Secondary index, bloom filter, 
> Primary Key, Bulk load, Update)
> 
> · Joins with data from other sources
> 
> · Python/Java/Scala support
> 
> · Support latest Spark 1.4.0 release
> 
>  
> 
> The tests by Huawei team and community contributors covered the areas: bulk 
> load; projection pruning; partition pruning; partial evaluation; code 
> generation; coprocessor; customer filtering; DML; complex filtering on keys 
> and non-keys; Join/union with non-Hbase data; Data Frame; multi-column family 
> test.  We will post the test results including performance tests the middle 
> of August.
> You are very welcomed to try out or deploy the package, and help improve the 
> integration tests with various combinations of the settings, extensive Data 
> Frame tests, complex join/union test and extensive performance tests.  Please 
> use the “Issues” “Pull Requests” links at this package homepage, if you want 
> to report bugs, improvement or feature requests.
> Special thanks to project owner and technical leader Yan Zhou, Huawei global 
> team, community contributors and Databricks.   Databricks has been providing 
> great assistance from the design to the release.
> “Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
> query and analytics of large scale data sets in vertical enterprises. We will 
> continue to work with the community to develop new features and improve code 
> base.  Your comments and suggestions are greatly appreciated.
>  
> Yan Zhou / Bing Xiao
> Huawei Big Data team
>  
>  


答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
Ok. Then a question will be to define a boundary between a query engine and a 
built-in processing. If, for instance, the Spark DataFrame functionalities 
involving shuffling are to be supported inside HBase,
in my opinion, it’d be hard not to tag it as an query engine. If, on the other 
hand, only map-side ops from DataFrame are to be supported inside HBase, then 
Astro’s coprocessor already has the capabilities.

Again, I still have no full knowledge about HBase-14181 beyond your description 
in email. So my opinion above might be skewed as result.

Regards,

Yan

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 15:28
收件人: Yan Zhou.sc
抄送: Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
主题: Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

HBase will not have query engine.

It will provide better support to query engines.

Cheers


On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.org; 
u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan / Bing:
Mind taking a look at 
HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
mailto:bing.x...@huawei.com>> wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team




Re: [discuss] Removing individual commit messages from the squash commit message

2015-08-11 Thread Reynold Xin
This is now done with this pull request:
https://github.com/apache/spark/pull/8091


Committers please update the script to get this "feature".


On Mon, Jul 20, 2015 at 12:28 AM, Manoj Kumar <
manojkumarsivaraj...@gmail.com> wrote:

> +1
>
> Sounds like a great idea.
>
> On Sun, Jul 19, 2015 at 10:54 PM, Sandy Ryza 
> wrote:
>
>> +1
>>
>> On Sat, Jul 18, 2015 at 4:00 PM, Mridul Muralidharan 
>> wrote:
>>
>>> Thanks for detailing, definitely sounds better.
>>> +1
>>>
>>> Regards
>>> Mridul
>>>
>>> On Saturday, July 18, 2015, Reynold Xin  wrote:
>>>
 A single commit message consisting of:

 1. Pull request title (which includes JIRA number and component, e.g.
 [SPARK-1234][MLlib])

 2. Pull request description

 3. List of authors contributing to the patch

 The main thing that changes is 3: we used to also include the
 individual commits to the pull request branch that are squashed.


 On Sat, Jul 18, 2015 at 3:45 PM, Mridul Muralidharan 
 wrote:

> Just to clarify, the proposal is to have a single commit msg giving
> the jira and pr id?
> That sounds like a good change to have.
>
> Regards
> Mridul
>
>
> On Saturday, July 18, 2015, Reynold Xin  wrote:
>
>> I took a look at the commit messages in git log -- it looks like the
>> individual commit messages are not that useful to include, but do make 
>> the
>> commit messages more verbose. They are usually just a bunch of extremely
>> concise descriptions of "bug fixes", "merges", etc:
>>
>> cb3f12d [xxx] add whitespace
>> 6d874a6 [xxx] support pyspark for yarn-client
>>
>> 89b01f5 [yyy] Update the unit test to add more cases
>> 275d252 [yyy] Address the comments
>> 7cc146d [yyy] Address the comments
>> 2624723 [yyy] Fix rebase conflict
>> 45befaa [yyy] Update the unit test
>> bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue
>>
>>
>> Anybody against removing those from the merge script so the log looks
>> cleaner? If nobody feels strongly about this, we can just create a JIRA 
>> to
>> remove them, and only keep the author names.
>>
>>

>>
>
>
> --
> Godspeed,
> Manoj Kumar,
> http://manojbits.wordpress.com
> 
> http://github.com/MechCoder
>


Re: Pushing Spark to 10Gb/s

2015-08-11 Thread Akhil Das
Hi Starch,

It also depends on the applications behavior, some might not be properly
able to utilize the network. If you are using say Kafka, then one thing
that you should keep in mind is the Size of the individual message and the
number of partitions that you are having. The higher the message size and
higher number of partitions (in kafka) will utilize the network properly.
With this combination, we have operated few pipelines running at 10Gb/s (~
1GB/s ).

Thanks
Best Regards

On Tue, Aug 11, 2015 at 12:24 AM, Starch, Michael D (398M) <
michael.d.sta...@jpl.nasa.gov> wrote:

> All,
>
> I am trying to get data moving in and out of spark at 10Gb/s. I currently
> have a very powerful cluster to work on, offering 40Gb/s inifiniband links
> so I believe the network pipe should be fast enough.
>
> Has anyone gotten spark operating at high data rates before? Any advice
> would be appreciated.
>
> -Michael Starch
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think.

+ dev list

Thanks
Best Regards

On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon  wrote:

> Dear Sir / Madam,
>
> I have a plan to contribute some codes about passing filters to a
> datasource as physical planning.
>
> In more detail, I understand when we want to build up filter operations
> from data like Parquet (when actually reading and filtering HDFS blocks at
> first not filtering in memory with Spark operations), we need to implement
>
> PrunedFilteredScan, PrunedScan or CatalystScan in package
> org.apache.spark.sql.sources.
>
>
>
> For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
> org.apache.spark.sql.sources, which do not access directly to the query
> parser but are objects built by selectFilters() in package
> org.apache.spark.sql.sources.DataSourceStrategy.
>
> It looks all the filters (rather raw expressions) do not pass to the
> function below in PrunedFilteredScan and PrunedScan.
>
> def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
> RDD[Row]
>
> The passing filters in here are defined in package
> org.apache.spark.sql.sources.
>
> On the other hand, it does not pass EqualNullSafe filter in package
> org.apache.spark.sql.catalyst.expressions even though this looks possible
> to pass for other datasources such as Parquet and JSON.
>
>
>
> I understand that  CatalystScan can take the all raw expression accessing
> to the query planner. However, it is experimental and also it needs
> different interfaces (as well as unstable for the reasons such as binary
> capability).
>
> As far as I know, Parquet also does not use this.
>
>
>
> In general, this can be a issue as a user send a query to data such as
>
> 1.
>
> SELECT *
> FROM table
> WHERE field = 1;
>
>
> 2.
>
> SELECT *
> FROM table
> WHERE field <=> 1;
>
>
> The second query can be hugely slow because of large network traffic by
> not filtered data from the source RDD.
>
>
>
> Also,I could not find a proper issue for this (except for
> https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
> now binary capability.
>
> Accordingly, I want to add this issue and make a pull request with my
> codes.
>
>
> Could you please make any comments for this?
>
> Thanks.
>
>


Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
As my understanding, OutputCommitCoordinator should only be necessary for
ResultStage (especially for ResultStage with hdfs write), but currently it
is used for all the stages. Is there any reason for that ?

-- 
Best Regards

Jeff Zhang


Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi

My Spark job (running in local[*] with spark 1.4.1) reads data from a
thrift server(Created an RDD, it will compute the partitions in
getPartitions() call and in computes hasNext will return records from these
partitions), count(), foreach() is working fine it returns the correct
number of records. But whenever there is shuffleMap stage (like reduceByKey
etc.) then all the tasks are executing properly but it enters in an
infinite loop saying :


   1. 15/08/11 13:05:54 INFO DAGScheduler: Resubmitting ShuffleMapStage 1 (map
   at FilterMain.scala:59) because some of its tasks had failed: 0, 3


Here's the complete stack-trace http://pastebin.com/hyK7cG8S

What could be the root cause of this problem? I looked up and bumped into
this closed JIRA  (which
is very very old)




Thanks
Best Regards


Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi,

Recently, I use spark sql to do join on non-equality condition, condition1
or condition2 for example.

Spark will use broadcastNestedLoopJoin to do this. Assume that one of
dataframe(df1) is not created from hive table nor local collection and the
other one is created from hivetable(df2). For df1, spark will use
defaultSizeInBytes * length of df1 to estimate the size of df1 and use
correct size for df2.

As the result, in most cases, spark will think df1 is bigger than df2 even
df2 is really huge. And spark will do df2.collect(), which will cause error
or slowness of program.

Maybe we should just use defaultSizeInBytes for logicalRDD, not
defaultSizeInBytes * length?

Hope this could be helpful
Thanks a lot in advance for your help and input.

Cheers
Gen


Re: Master JIRA ticket for tracking Spark 1.5.0 configuration renames, defaults changes, and configuration deprecation

2015-08-11 Thread Tom Graves
is there a jira for in compatibilities?  I was just trying spark 1.5 and it 
appears with dataframe aggregates (like sum) now return columns named 
sum(columname) where as in spark 1.4 it was SUM(columnname).  note the capital 
vs lower case.
I wanted to check and make sure this was a known change?
Tom 


 On Monday, August 3, 2015 12:07 AM, Josh Rosen  
wrote:
   

 To help us track planned / finished configuration renames, defaults changes, 
and configuration deprecation for the upcoming 1.5.0 release, I have created 
https://issues.apache.org/jira/browse/SPARK-9550.
As you make configuration changes or think of configurations that need to be 
audited, please update that ticket by editing it or posting a comment.
This ticket will help us later when it comes time to draft release notes.
Thanks,Josh

  

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Josh Rosen
Can you clarify what you mean by "used for all stages"? 
OutputCommitCoordinator RPCs should only be initiated through 
SparkHadoopMapRedUtil.commitTask(), so while the OutputCommitCoordinator 
doesn't make a distinction between ShuffleMapStages and ResultStages 
there still should not be a performance penalty for this because the 
extra rounds of RPCs should only be performed when necessary.


On 8/11/15 2:25 AM, Jeff Zhang wrote:
As my understanding, OutputCommitCoordinator should only be necessary 
for ResultStage (especially for ResultStage with hdfs write), but 
currently it is used for all the stages. Is there any reason for that ?


--
Best Regards

Jeff Zhang



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Sources/pom for org.spark-project.hive

2015-08-11 Thread Pala M Muthaia
Hi,

I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We
have some customizations in Hive (custom authorization, various hooks etc)
that are all part of hive-exec.

Given Spark's hive dependency is through org.spark-project.hive groupId,
looks like i need to modify the definition of hive-exec artifact there to
take dependency on our internal hive (vs org.apache.hive), and then
everything else would flow through.

However, i am unable to find sources for org.spark-project.hive to make
this change. Is it available? Otherwise, how can i proceed in this
situation?


Thanks,
pala


Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Ted Yu
Have you looked at
https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf ?

Cheers

On Tue, Aug 11, 2015 at 12:25 PM, Pala M Muthaia <
mchett...@rocketfuelinc.com> wrote:

> Hi,
>
> I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We
> have some customizations in Hive (custom authorization, various hooks etc)
> that are all part of hive-exec.
>
> Given Spark's hive dependency is through org.spark-project.hive groupId,
> looks like i need to modify the definition of hive-exec artifact there to
> take dependency on our internal hive (vs org.apache.hive), and then
> everything else would flow through.
>
> However, i am unable to find sources for org.spark-project.hive to make
> this change. Is it available? Otherwise, how can i proceed in this
> situation?
>
>
> Thanks,
> pala
>


Re: PySpark on PyPi

2015-08-11 Thread westurner
Matt Goodman wrote
> I would tentatively suggest also conda packaging.
> 
> http://conda.pydata.org/docs/

$ conda skeleton pypi pyspark
# update git_tag and git_uri
# add test commands (import pyspark; import pyspark.[...])

Docs for building conda packages for multiple operating systems and
interpreters from PyPi packages:

*
http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
* https://github.com/audreyr/cookiecutter/issues/232


Matt Goodman wrote
> --Matthew Goodman
> 
> =
> Check Out My Website: http://craneium.net
> Find me on LinkedIn: http://tinyurl.com/d6wlch
> 
> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <

> davies@

> > wrote:
> 
>> I think so, any contributions on this are welcome.
>>
>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <

> ellisonbg@

> >
>> wrote:
>> > Sorry, trying to follow the context here. Does it look like there is
>> > support for the idea of creating a setup.py file and pypi package for
>> > pyspark?
>> >
>> > Cheers,
>> >
>> > Brian
>> >
>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <

> davies@

> >
>> wrote:
>> >> We could do that after 1.5 released, it will have same release cycle
>> >> as Spark in the future.
>> >>
>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>> >> <

> o.girardot@

> > wrote:
>> >>> +1 (once again :) )
>> >>>
>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <

> justin.uang@

> >:
>> 
>>  // ping
>> 
>>  do we have any signoff from the pyspark devs to submit a PR to
>> publish to
>>  PyPI?
>> 
>>  On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
>> 

> freeman.jeremy@

>>
>>  wrote:
>> >
>> > Hey all, great discussion, just wanted to +1 that I see a lot of
>> value in
>> > steps that make it easier to use PySpark as an ordinary python
>> library.
>> >
>> > You might want to check out this
>> (https://github.com/minrk/findspark
>> ),
>> > started by Jupyter project devs, that offers one way to facilitate
>> this
>> > stuff. I’ve also cced them here to join the conversation.
>> >
>> > Also, @Jey, I can also confirm that at least in some scenarios
>> (I’ve
>> done
>> > it in an EC2 cluster in standalone mode) it’s possible to run
>> PySpark jobs
>> > just using `from pyspark import SparkContext; sc =
>> SparkContext(master=“X”)`
>> > so long as the environmental variables (PYTHONPATH and
>> PYSPARK_PYTHON) are
>> > set correctly on *both* workers and driver. That said, there’s
>> definitely
>> > additional configuration / functionality that would require going
>> through
>> > the proper submit scripts.
>> >
>> > On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
>> 

> punya.biswal@

>>
>> > wrote:
>> >
>> > I agree with everything Justin just said. An additional advantage
>> of
>> > publishing PySpark's Python code in a standards-compliant way is
>> the
>> fact
>> > that we'll be able to declare transitive dependencies (Pandas,
>> Py4J)
>> in a
>> > way that pip can use. Contrast this with the current situation,
>> where
>> > df.toPandas() exists in the Spark API but doesn't actually work
>> until you
>> > install Pandas.
>> >
>> > Punya
>> > On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <

> justin.uang@

> >
>> > wrote:
>> >>
>> >> // + Davies for his comments
>> >> // + Punya for SA
>> >>
>> >> For development and CI, like Olivier mentioned, I think it would
>> be
>> >> hugely beneficial to publish pyspark (only code in the python/
>> dir)
>> on PyPI.
>> >> If anyone wants to develop against PySpark APIs, they need to
>> download the
>> >> distribution and do a lot of PYTHONPATH munging for all the tools
>> (pylint,
>> >> pytest, IDE code completion). Right now that involves adding
>> python/ and
>> >> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
>> more
>> >> dependencies, we would have to manually mirror all the PYTHONPATH
>> munging in
>> >> the ./pyspark script. With a proper pyspark setup.py which
>> declares
>> its
>> >> dependencies, and a published distribution, depending on pyspark
>> will just
>> >> be adding pyspark to my setup.py dependencies.
>> >>
>> >> Of course, if we actually want to run parts of pyspark that is
>> backed by
>> >> Py4J calls, then we need the full spark distribution with either
>> ./pyspark
>> >> or ./spark-submit, but for things like linting and development,
>> the
>> >> PYTHONPATH munging is very annoying.
>> >>
>> >> I don't think the version-mismatch issues are a compelling reason
>> to not
>> >> go ahead with PyPI publishing. At runtime, we should definitely
>> enforce that
>> >> the version has to be exact, which means there is no backcompat
>> nightmare as
>> >> suggested by Davies in
>> https://issues.apache.org/jira/browse/SPARK-1267.
>> >> This would mean that even if the us

Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Steve Loughran

On 11 Aug 2015, at 12:25, Pala M Muthaia 
mailto:mchett...@rocketfuelinc.com>> wrote:

Hi,

I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We have 
some customizations in Hive (custom authorization, various hooks etc) that are 
all part of hive-exec.

Given Spark's hive dependency is through org.spark-project.hive groupId, looks 
like i need to modify the definition of hive-exec artifact there to take 
dependency on our internal hive (vs org.apache.hive), and then everything else 
would flow through.


you can just change the hive group definition in your spark build; that's the 
easy part
org.spark-project.hive

harder is getting a consistent kryo binding and any other shading/unshading. In 
SPARK-8064 we've moved Spark 1.5 to using Hive 1.2.1, but even there we had to 
patch hive to use the same Kryo version, and shade protobuf in hive-exec for 
everything to work on Hadoop 1.x.


However, i am unable to find sources for org.spark-project.hive to make this 
change. Is it available? Otherwise, how can i proceed in this situation?

Ted's pointed to the 0.13 code; the 1.2.1 is under 
https://github.com/pwendell/hive/commits/release-1.2.1-spark

however: do not attempt to change hive versions in a release, things are 
intertwined at the SparkSQL level your code just won't work.



Thanks,
pala



Re: PySpark on PyPi

2015-08-11 Thread westurner
westurner wrote
> 
> Matt Goodman wrote
>> I would tentatively suggest also conda packaging.
>> 
>> http://conda.pydata.org/docs/
> $ conda skeleton pypi pyspark
> # update git_tag and git_uri
> # add test commands (import pyspark; import pyspark.[...])
> 
> Docs for building conda packages for multiple operating systems and
> interpreters from PyPi packages:
> 
> *
> http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
> * https://github.com/audreyr/cookiecutter/issues/232

* conda meta.yaml can specify e.g. a test.sh script(s) that should return 0
 
  Docs: http://conda.pydata.org/docs/building/meta-yaml.html#test-section


Wes Turner wrote
> 
> Matt Goodman wrote
>> --Matthew Goodman
>> 
>> =
>> Check Out My Website: http://craneium.net
>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> 
>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <

>> davies@

>> > wrote:
>> 
>>> I think so, any contributions on this are welcome.
>>>
>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <

>> ellisonbg@

>> >
>>> wrote:
>>> > Sorry, trying to follow the context here. Does it look like there is
>>> > support for the idea of creating a setup.py file and pypi package for
>>> > pyspark?
>>> >
>>> > Cheers,
>>> >
>>> > Brian
>>> >
>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <

>> davies@

>> >
>>> wrote:
>>> >> We could do that after 1.5 released, it will have same release cycle
>>> >> as Spark in the future.
>>> >>
>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>> >> <

>> o.girardot@

>> > wrote:
>>> >>> +1 (once again :) )
>>> >>>
>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <

>> justin.uang@

>> >:
>>> 
>>>  // ping
>>> 
>>>  do we have any signoff from the pyspark devs to submit a PR to
>>> publish to
>>>  PyPI?
>>> 
>>>  On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
>>> 

>> freeman.jeremy@

>>>
>>>  wrote:
>>> >
>>> > Hey all, great discussion, just wanted to +1 that I see a lot of
>>> value in
>>> > steps that make it easier to use PySpark as an ordinary python
>>> library.
>>> >
>>> > You might want to check out this
>>> (https://github.com/minrk/findspark
>>> ),
>>> > started by Jupyter project devs, that offers one way to facilitate
>>> this
>>> > stuff. I’ve also cced them here to join the conversation.
>>> >
>>> > Also, @Jey, I can also confirm that at least in some scenarios
>>> (I’ve
>>> done
>>> > it in an EC2 cluster in standalone mode) it’s possible to run
>>> PySpark jobs
>>> > just using `from pyspark import SparkContext; sc =
>>> SparkContext(master=“X”)`
>>> > so long as the environmental variables (PYTHONPATH and
>>> PYSPARK_PYTHON) are
>>> > set correctly on *both* workers and driver. That said, there’s
>>> definitely
>>> > additional configuration / functionality that would require going
>>> through
>>> > the proper submit scripts.
>>> >
>>> > On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
>>> 

>> punya.biswal@

>>>
>>> > wrote:
>>> >
>>> > I agree with everything Justin just said. An additional advantage
>>> of
>>> > publishing PySpark's Python code in a standards-compliant way is
>>> the
>>> fact
>>> > that we'll be able to declare transitive dependencies (Pandas,
>>> Py4J)
>>> in a
>>> > way that pip can use. Contrast this with the current situation,
>>> where
>>> > df.toPandas() exists in the Spark API but doesn't actually work
>>> until you
>>> > install Pandas.
>>> >
>>> > Punya
>>> > On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <

>> justin.uang@

>> >
>>> > wrote:
>>> >>
>>> >> // + Davies for his comments
>>> >> // + Punya for SA
>>> >>
>>> >> For development and CI, like Olivier mentioned, I think it would
>>> be
>>> >> hugely beneficial to publish pyspark (only code in the python/
>>> dir)
>>> on PyPI.
>>> >> If anyone wants to develop against PySpark APIs, they need to
>>> download the
>>> >> distribution and do a lot of PYTHONPATH munging for all the tools
>>> (pylint,
>>> >> pytest, IDE code completion). Right now that involves adding
>>> python/ and
>>> >> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>>> add
>>> more
>>> >> dependencies, we would have to manually mirror all the PYTHONPATH
>>> munging in
>>> >> the ./pyspark script. With a proper pyspark setup.py which
>>> declares
>>> its
>>> >> dependencies, and a published distribution, depending on pyspark
>>> will just
>>> >> be adding pyspark to my setup.py dependencies.
>>> >>
>>> >> Of course, if we actually want to run parts of pyspark that is
>>> backed by
>>> >> Py4J calls, then we need the full spark distribution with either
>>> ./pyspark
>>> >> or ./spark-submit, but for things like linting and development,
>>> the
>>> >> PYTHONPATH munging is very annoying.
>>> >>
>>> >> I don't think the version

答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

HBase will not have query engine.

It will provide better support to query engines.

Cheers


On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.org; 
u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan / Bing:
Mind taking a look at 
HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
mailto:bing.x...@huawei.com>> wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team




Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Yu
Yan:
Where can I find performance numbers for Astro (it's close to middle of
August) ?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc  wrote:

> Finally I can take a look at HBASE-14181 now. Unfortunately there is no
> design doc mentioned. Superficially it is very similar to Astro with a
> difference of
>
> this being part of HBase client library; while Astro works as a Spark
> package so will evolve and function more closely with Spark SQL/Dataframe
> instead of HBase.
>
>
>
> In terms of architecture, my take is loosely-coupled query engines on top
> of KV store vs. an array of query engines supported by, and packaged as
> part of, a KV store.
>
>
>
> Functionality-wise the two could be close but Astro also supports Python
> as a result of tight integration with Spark.
>
> It will be interesting to see performance comparisons when HBase-14181 is
> ready.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Tuesday, August 11, 2015 3:28 PM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"
>
>
>
> HBase will not have query engine.
>
>
>
> It will provide better support to query engines.
>
>
>
> Cheers
>
>
> On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc  wrote:
>
> Ted,
>
>
>
> I’m in China now, and seem to experience difficulty to access Apache Jira.
> Anyways, it appears to me  that HBASE-14181
>  attempts to support
> Spark DataFrame inside HBase.
>
> If true, one question to me is whether HBase is intended to have a
> built-in query engine or not. Or it will stick with the current way as
>
> a k-v store with some built-in processing capabilities in the forms of
> coprocessor, custom filter, …, etc., which allows for loosely-coupled query
> engines
>
> built on top of it.
>
>
>
> Thanks,
>
>
>
> *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com ]
> *发送时间**:* 2015年8月11日 8:54
> *收件人**:* Bing Xiao (Bing)
> *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
> *主题**:* Re: Package Release Annoucement: Spark SQL on HBase "Astro"
>
>
>
> Yan / Bing:
>
> Mind taking a look at HBASE-14181
>  'Add Spark DataFrame
> DataSource to HBase-Spark Module' ?
>
>
>
> Thanks
>
>
>
> On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
> wrote:
>
> We are happy to announce the availability of the Spark SQL on HBase 1.0.0
> release.
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
>
> The main features in this package, dubbed “Astro”, include:
>
> · Systematic and powerful handling of data pruning and
> intelligent scan, based on partial evaluation technique
>
> · HBase pushdown capabilities like custom filters and coprocessor
> to support ultra low latency processing
>
> · SQL, Data Frame support
>
> · More SQL capabilities made possible (Secondary index, bloom
> filter, Primary Key, Bulk load, Update)
>
> · Joins with data from other sources
>
> · Python/Java/Scala support
>
> · Support latest Spark 1.4.0 release
>
>
>
> The tests by Huawei team and community contributors covered the areas:
> bulk load; projection pruning; partition pruning; partial evaluation; code
> generation; coprocessor; customer filtering; DML; complex filtering on keys
> and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
> family test.  We will post the test results including performance tests the
> middle of August.
>
> You are very welcomed to try out or deploy the package, and help improve
> the integration tests with various combinations of the settings, extensive
> Data Frame tests, complex join/union test and extensive performance tests.
> Please use the “Issues” “Pull Requests” links at this package homepage, if
> you want to report bugs, improvement or feature requests.
>
> Special thanks to project owner and technical leader Yan Zhou, Huawei
> global team, community contributors and Databricks.   Databricks has been
> providing great assistance from the design to the release.
>
> “Astro”, the Spark SQL on HBase package will be useful for ultra low
> latency* query and analytics of large scale data sets in vertical
> enterprises**.* We will continue to work with the community to develop
> new features and improve code base.  Your comments and suggestions are
> greatly appreciated.
>
>
>
> Yan Zhou / Bing Xiao
>
> Huawei Big Data team
>
>
>
>
>
>


RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; 
u...@spark.apache.org
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

HBase will not have query engine.

It will provide better support to query engines.

Cheers

On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.org; 
u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan / Bing:
Mind taking a look at 
HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
mailto:bing.x...@huawei.com>> wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with th

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I
can help clarify.

The hbase-spark module is just focused on making spark integration with
hbase easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in
hbase.  This jira comes the closest to that line.  The main thing here is
filter push down logic for basic sql operation like =, >
, and <.  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to
do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, "Yan Zhou.sc"  wrote:

> We have not “formally” published any numbers yet. A good reference is a
> slide deck we posted for the meetup in March.
>
> , or better yet for interested parties to run performance comparisons by
> themselves for now.
>
>
>
> As for status quo of Astro, we have been focusing on fixing bugs
> (UDF-related bug in some coprocessor/custom filter combos), and add support
> of querying string columns in HBase as integers from Astro.
>
>
>
> Thanks,
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Wednesday, August 12, 2015 7:02 AM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Yan:
>
> Where can I find performance numbers for Astro (it's close to middle of
> August) ?
>
>
>
> Cheers
>
>
>
> On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
> wrote:
>
> Finally I can take a look at HBASE-14181 now. Unfortunately there is no
> design doc mentioned. Superficially it is very similar to Astro with a
> difference of
>
> this being part of HBase client library; while Astro works as a Spark
> package so will evolve and function more closely with Spark SQL/Dataframe
> instead of HBase.
>
>
>
> In terms of architecture, my take is loosely-coupled query engines on top
> of KV store vs. an array of query engines supported by, and packaged as
> part of, a KV store.
>
>
>
> Functionality-wise the two could be close but Astro also supports Python
> as a result of tight integration with Spark.
>
> It will be interesting to see performance comparisons when HBase-14181 is
> ready.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Tuesday, August 11, 2015 3:28 PM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"
>
>
>
> HBase will not have query engine.
>
>
>
> It will provide better support to query engines.
>
>
>
> Cheers
>
>
> On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc  wrote:
>
> Ted,
>
>
>
> I’m in China now, and seem to experience difficulty to access Apache Jira.
> Anyways, it appears to me  that HBASE-14181
>  attempts to support
> Spark DataFrame inside HBase.
>
> If true, one question to me is whether HBase is intended to have a
> built-in query engine or not. Or it will stick with the current way as
>
> a k-v store with some built-in processing capabilities in the forms of
> coprocessor, custom filter, …, etc., which allows for loosely-coupled query
> engines
>
> built on top of it.
>
>
>
> Thanks,
>
>
>
> *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com ]
> *发送时间**:* 2015年8月11日 8:54
> *收件人**:* Bing Xiao (Bing)
> *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
> *主题**:* Re: Package Release Annoucement: Spark SQL on HBase "Astro"
>
>
>
> Yan / Bing:
>
> Mind taking a look at HBASE-14181
>  'Add Spark DataFrame
> DataSource to HBase-Spark Module' ?
>
>
>
> Thanks
>
>
>
> On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
> wrote:
>
> We are happy to announce the availability of the Spark SQL on HBase 1.0.0
> release.
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
>
> The main features in this package, dubbed “Astro”, include:
>
> · Systematic and powerful handling of data pruning and
> intelligent scan, based on partial evaluation technique
>
> · HBase pushdown capabilities like custom filters and coprocessor
> to support ultra low latency processing
>
> · SQL, Data Frame support
>
> · More SQL capabilities made possible (Secondary index, bloom
> filter, Primary Key, Bulk load, Update)
>
> · Joins with data from other sources
>
> · Python/Java/Scala support
>
> · Support latest Spark 1.4.0 release
>
>
>
> The tests by Huawei team and community contributors covered the areas:
> bulk load; projection pruning; partition pruning; partial evaluation; code
> generation; coprocessor; customer filtering; DML; complex filtering on keys
> and non-keys; Join/union with non-Hbase data; Data Frame; m

Re: Is OutputCommitCoordinator necessary for all the stages ?

2015-08-11 Thread Jeff Zhang
Hi Josh,

I mean on the driver side. OutputCommitCorrdinator.startStage is called in
DAGScheduler#submitMissingTasks for all the stages (cost some memory).
Although it is fine that as long as executor side don't call RPC, there's
no much performance penalty.

On Wed, Aug 12, 2015 at 12:17 AM, Josh Rosen  wrote:

> Can you clarify what you mean by "used for all stages"?
> OutputCommitCoordinator RPCs should only be initiated through
> SparkHadoopMapRedUtil.commitTask(), so while the OutputCommitCoordinator
> doesn't make a distinction between ShuffleMapStages and ResultStages there
> still should not be a performance penalty for this because the extra rounds
> of RPCs should only be performed when necessary.
>
>
> On 8/11/15 2:25 AM, Jeff Zhang wrote:
>
>> As my understanding, OutputCommitCoordinator should only be necessary for
>> ResultStage (especially for ResultStage with hdfs write), but currently it
>> is used for all the stages. Is there any reason for that ?
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang


Re: Sources/pom for org.spark-project.hive

2015-08-11 Thread Pala M Muthaia
Thanks for the pointers. Yes, i started with changing the hive.group
property in pom and started seeing various dependency issues.

Initially i thought spark-project.hive was just a pom for uber jars that
pull in hive classes without transitive dependencies like kryo, but looks
like lot more changes are needed, including editing the sources.

We are looking at alternative approaches, since our customizations to Hive
are pretty limited and may not warrant the effort required here.

Thanks.

On Tue, Aug 11, 2015 at 2:29 PM, Steve Loughran 
wrote:

>
> On 11 Aug 2015, at 12:25, Pala M Muthaia 
> wrote:
>
> Hi,
>
> I am trying to make Spark SQL 1.4 work with our internal fork of Hive. We
> have some customizations in Hive (custom authorization, various hooks etc)
> that are all part of hive-exec.
>
> Given Spark's hive dependency is through org.spark-project.hive groupId,
> looks like i need to modify the definition of hive-exec artifact there to
> take dependency on our internal hive (vs org.apache.hive), and then
> everything else would flow through.
>
>
> you can just change the hive group definition in your spark build; that's
> the easy part
> org.spark-project.hive
>
> harder is getting a consistent kryo binding and any other
> shading/unshading. In SPARK-8064 we've moved Spark 1.5 to using Hive 1.2.1,
> but even there we had to patch hive to use the same Kryo version, and shade
> protobuf in hive-exec for everything to work on Hadoop 1.x.
>
>
> However, i am unable to find sources for org.spark-project.hive to make
> this change. Is it available? Otherwise, how can i proceed in this
> situation?
>
>
> Ted's pointed to the 0.13 code; the 1.2.1 is under
> https://github.com/pwendell/hive/commits/release-1.2.1-spark
>
> however: do not attempt to change hive versions in a release, things are
> intertwined at the SparkSQL level your code just won't work.
>
>
>
> Thanks,
> pala
>
>
>


RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,


From: Ted Malaska [mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"


Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, >
, and <.  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" 
mailto:yan.zhou...@huawei.com>> wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; 
u...@spark.apache.org
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; 
u...@spark.apache.org
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

HBase will not have query engine.

It will provide better support to query engines.

Cheers

On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it 

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
The bulk load code is 14150 if u r interested.  Let me know how it can be
made faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, "Yan Zhou.sc"  wrote:

> Ted,
>
>
>
> Thanks for pointing out more details of HBase-14181. I am afraid I may
> still need to learn more before I can make very accurate and pointed
> comments.
>
>
>
> As for filter push down, Astro has a powerful approach to basically break
> down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
>
> to generate partition-specific predicates to be pushed down to HBase. This
> may not be a significant performance improvement if the filter logic is
> simple and/or the processing is IO-bound,
>
> but could be so for online ad-hoc analysis.
>
>
>
> For UDFs, Astro supports it both in and out of HBase custom filter.
>
>
>
> For secondary index, Astro do not support it now. With the probable
> support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
> could add this support along with its specific optimizations.
>
>
>
> For bulk load, Astro has a much faster way to load the tabular data, we
> believe.
>
>
>
> Right now, Astro’s filter pushdown is through HBase built-in filters and
> custom filter.
>
>
>
> As for HBase-14181, I see some overlaps with Astro. Both have dependences
> on Spark SQL, and both supports Spark Dataframe as an access interface,
> both supports predicate pushdown.
>
> Astro is not designed for MR (or Spark’s equivalent) access though.
>
>
>
> If HBase-14181 is shooting for access to HBase data through a subset of
> DataFrame functionalities like filter, projection, and other map-side ops,
> would it be feasible to decouple it from Spark?
>
> My understanding is that 14181 does not run Spark execution engine at all,
> but will make use of Spark Dataframe semantic and/or logic planning to pass
> a logic (sub-)plan to the HBase. If true, it might
>
> be desirable to directly support Dataframe in HBase.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
> *Sent:* Wednesday, August 12, 2015 7:28 AM
> *To:* Yan Zhou.sc
> *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
> *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Hey Yan,
>
> I've been the one building out this spark functionality in hbase so maybe
> I can help clarify.
>
> The hbase-spark module is just focused on making spark integration with
> hbase easy and out of the box for both spark and spark streaming.
>
> I and I believe the hbase team has no desire to build a sql engine in
> hbase.  This jira comes the closest to that line.  The main thing here is
> filter push down logic for basic sql operation like =, >
> , and <.  User define functions and secondary indexes are not in my scope.
>
> Another main goal of hbase-spark module is to be able to allow a user to
> do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
> load.
>
> Let me know if u have any questions
>
> Ted Malaska
>
> On Aug 11, 2015 7:13 PM, "Yan Zhou.sc"  wrote:
>
> We have not “formally” published any numbers yet. A good reference is a
> slide deck we posted for the meetup in March.
>
> , or better yet for interested parties to run performance comparisons by
> themselves for now.
>
>
>
> As for status quo of Astro, we have been focusing on fixing bugs
> (UDF-related bug in some coprocessor/custom filter combos), and add support
> of querying string columns in HBase as integers from Astro.
>
>
>
> Thanks,
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Wednesday, August 12, 2015 7:02 AM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Yan:
>
> Where can I find performance numbers for Astro (it's close to middle of
> August) ?
>
>
>
> Cheers
>
>
>
> On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
> wrote:
>
> Finally I can take a look at HBASE-14181 now. Unfortunately there is no
> design doc mentioned. Superficially it is very similar to Astro with a
> difference of
>
> this being part of HBase client library; while Astro works as a Spark
> package so will evolve and function more closely with Spark SQL/Dataframe
> instead of HBase.
>
>
>
> In terms of architecture, my take is loosely-coupled query engines on top
> of KV store vs. an array of query engines supported by, and packaged as
> part of, a KV store.
>
>
>
> Functionality-wise the two could be close but Astro also supports Python
> as a result of tight integration with Spark.
>
> It will be interesting to see performance comparisons when HBase-14181 is
> ready.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Tuesday, August 11, 2015 3:28 PM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* R

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
No, Astro bulkloader does not use its own shuffle. But map/reduce-side 
processing is somewhat different from HBase’s bulk loader that are used by many 
HBase apps I believe.

From: Ted Malaska [mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc: dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"


The bulk load code is 14150 if u r interested.  Let me know how it can be made 
faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own 
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" 
mailto:yan.zhou...@huawei.com>> wrote:
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,


From: Ted Malaska 
[mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; dev@spark.apache.org; Bing Xiao (Bing); 
Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"


Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, >
, and <.  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" 
mailto:yan.zhou...@huawei.com>> wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; 
u...@spark.apache.org
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
mailto:yan.zhou...@huawei.com>> wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
There a number of ways to bulk load.

There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
which is spark shuffle bulk load.

Let me know if I have missed a bulk loading option.  All these r possible
with the new hbase-spark module.

As for the filter push down discussion in the past email.  U will note in
14181 that the filter push will also limit the scan range or drop scan all
together for gets.

Ted Malaska
On Aug 11, 2015 9:06 PM, "Yan Zhou.sc"  wrote:

> No, Astro bulkloader does not use its own shuffle. But map/reduce-side
> processing is somewhat different from HBase’s bulk loader that are used by
> many HBase apps I believe.
>
>
>
> *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
> *Sent:* Wednesday, August 12, 2015 8:56 AM
> *To:* Yan Zhou.sc
> *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
> *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> The bulk load code is 14150 if u r interested.  Let me know how it can be
> made faster.
>
> It's just a spark shuffle and writing hfiles.   Unless astro wrote it's
> own shuffle the times should be very close.
>
> On Aug 11, 2015 8:49 PM, "Yan Zhou.sc"  wrote:
>
> Ted,
>
>
>
> Thanks for pointing out more details of HBase-14181. I am afraid I may
> still need to learn more before I can make very accurate and pointed
> comments.
>
>
>
> As for filter push down, Astro has a powerful approach to basically break
> down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
>
> to generate partition-specific predicates to be pushed down to HBase. This
> may not be a significant performance improvement if the filter logic is
> simple and/or the processing is IO-bound,
>
> but could be so for online ad-hoc analysis.
>
>
>
> For UDFs, Astro supports it both in and out of HBase custom filter.
>
>
>
> For secondary index, Astro do not support it now. With the probable
> support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
> could add this support along with its specific optimizations.
>
>
>
> For bulk load, Astro has a much faster way to load the tabular data, we
> believe.
>
>
>
> Right now, Astro’s filter pushdown is through HBase built-in filters and
> custom filter.
>
>
>
> As for HBase-14181, I see some overlaps with Astro. Both have dependences
> on Spark SQL, and both supports Spark Dataframe as an access interface,
> both supports predicate pushdown.
>
> Astro is not designed for MR (or Spark’s equivalent) access though.
>
>
>
> If HBase-14181 is shooting for access to HBase data through a subset of
> DataFrame functionalities like filter, projection, and other map-side ops,
> would it be feasible to decouple it from Spark?
>
> My understanding is that 14181 does not run Spark execution engine at all,
> but will make use of Spark Dataframe semantic and/or logic planning to pass
> a logic (sub-)plan to the HBase. If true, it might
>
> be desirable to directly support Dataframe in HBase.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
> *Sent:* Wednesday, August 12, 2015 7:28 AM
> *To:* Yan Zhou.sc
> *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
> *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Hey Yan,
>
> I've been the one building out this spark functionality in hbase so maybe
> I can help clarify.
>
> The hbase-spark module is just focused on making spark integration with
> hbase easy and out of the box for both spark and spark streaming.
>
> I and I believe the hbase team has no desire to build a sql engine in
> hbase.  This jira comes the closest to that line.  The main thing here is
> filter push down logic for basic sql operation like =, >
> , and <.  User define functions and secondary indexes are not in my scope.
>
> Another main goal of hbase-spark module is to be able to allow a user to
> do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
> load.
>
> Let me know if u have any questions
>
> Ted Malaska
>
> On Aug 11, 2015 7:13 PM, "Yan Zhou.sc"  wrote:
>
> We have not “formally” published any numbers yet. A good reference is a
> slide deck we posted for the meetup in March.
>
> , or better yet for interested parties to run performance comparisons by
> themselves for now.
>
>
>
> As for status quo of Astro, we have been focusing on fixing bugs
> (UDF-related bug in some coprocessor/custom filter combos), and add support
> of querying string columns in HBase as integers from Astro.
>
>
>
> Thanks,
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Wednesday, August 12, 2015 7:02 AM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Yan:
>
> Where can I find performance numbers for Astro (it's close to middle of
> August) ?
>
>
>
> Cheers
>
>
>
> On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
> wr

RE: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread Cheng, Hao
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN.

Currently, for the non-equal join, if the join type is the INNER join, then it 
will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the 
outer joins.

In the BroadcastnestedLoopJoin, the table with smaller estimate size will be 
broadcasted, but if the smaller table is also a huge table, I don’t think Spark 
SQL can handle that right now (OOM).

So, I am not sure how you created the df1 instance, but we’d better to reflect 
the real size for the statistics of it, and let the framework decide what to 
do, hopefully Spark Sql can support the non-equal join for large tables in the 
next release.

Hao

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Tuesday, August 11, 2015 10:12 PM
To: dev@spark.apache.org
Subject: Potential bug broadcastNestedLoopJoin or default value of 
spark.sql.autoBroadcastJoinThreshold

Hi,

Recently, I use spark sql to do join on non-equality condition, condition1 or 
condition2 for example.

Spark will use broadcastNestedLoopJoin to do this. Assume that one of 
dataframe(df1) is not created from hive table nor local collection and the 
other one is created from hivetable(df2). For df1, spark will use 
defaultSizeInBytes * length of df1 to estimate the size of df1 and use correct 
size for df2.

As the result, in most cases, spark will think df1 is bigger than df2 even df2 
is really huge. And spark will do df2.collect(), which will cause error or 
slowness of program.

Maybe we should just use defaultSizeInBytes for logicalRDD, not 
defaultSizeInBytes * length?

Hope this could be helpful
Thanks a lot in advance for your help and input.

Cheers
Gen



答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Yan Zhou.sc
We are using MR-based bulk loading on Spark.

For filter pushdown, Astro does partition-pruning, scan range pruning, and use 
Gets as much as possible.

Thanks,


发件人: Ted Malaska [mailto:ted.mala...@cloudera.com]
发送时间: 2015年8月12日 9:14
收件人: Yan Zhou.sc
抄送: dev@spark.apache.org; Bing Xiao (Bing); Ted Yu; user
主题: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"


There a number of ways to bulk load.

There is bulk put, partition bulk put, mr bulk load, and now hbase-14150 which 
is spark shuffle bulk load.

Let me know if I have missed a bulk loading option.  All these r possible with 
the new hbase-spark module.

As for the filter push down discussion in the past email.  U will note in 14181 
that the filter push will also limit the scan range or drop scan all together 
for gets.

Ted Malaska
On Aug 11, 2015 9:06 PM, "Yan Zhou.sc" 
mailto:yan.zhou...@huawei.com>> wrote:
No, Astro bulkloader does not use its own shuffle. But map/reduce-side 
processing is somewhat different from HBase’s bulk loader that are used by many 
HBase apps I believe.

From: Ted Malaska 
[mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc: dev@spark.apache.org; Ted Yu; Bing Xiao 
(Bing); user
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"


The bulk load code is 14150 if u r interested.  Let me know how it can be made 
faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own 
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" 
mailto:yan.zhou...@huawei.com>> wrote:
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,


From: Ted Malaska 
[mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; dev@spark.apache.org; Bing Xiao (Bing); 
Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"


Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, >
, and <.  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" 
mailto:yan.zhou...@huawei.com>> wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent:

Re: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread gen tang
Hi,

Thanks a lot.

The problem is not do non-equal join for large tables, in fact, one table
is really small and another one is huge.

The problem is that spark can only get the correct size for dataframe
created directly from hive table. Even we create a dataframe from local
collection, it uses defaultSizeInBytes as its size. (Here, I am really
confused: why we don't use LogicalLocalTable in exsitingRDD.scala to
estimate its size. As I understand, this case class is created for this
purpose)

Then if we do some join or unionAll operation on this dataframe, the
estimated size will explode.

For instance, if we do join, val df = df1.join(df2, condition) then
df.queryExecution.analyzed.statistics.sizeInBytes = df1 * df2

In my case, I create df1 instance from an existing rdd.

I find a workaround for this problem:
1. save df1 in hive table
2. read this hive table and create a new dataframe
3. do outer join with this new dataframe

Cheers
Gen

On Wed, Aug 12, 2015 at 10:12 AM, Cheng, Hao  wrote:

> Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL
> JOIN.
>
>
>
> Currently, for the non-equal join, if the join type is the INNER join,
> then it will be done by CartesianProduct join and BroadcastNestedLoopJoin
> works for the outer joins.
>
>
>
> In the BroadcastnestedLoopJoin, the table with smaller estimate size will
> be broadcasted, but if the smaller table is also a huge table, I don’t
> think Spark SQL can handle that right now (OOM).
>
>
>
> So, I am not sure how you created the df1 instance, but we’d better to
> reflect the real size for the statistics of it, and let the framework
> decide what to do, hopefully Spark Sql can support the non-equal join for
> large tables in the next release.
>
>
>
> Hao
>
>
>
> *From:* gen tang [mailto:gen.tan...@gmail.com]
> *Sent:* Tuesday, August 11, 2015 10:12 PM
> *To:* dev@spark.apache.org
> *Subject:* Potential bug broadcastNestedLoopJoin or default value of
> spark.sql.autoBroadcastJoinThreshold
>
>
>
> Hi,
>
>
>
> Recently, I use spark sql to do join on non-equality condition, condition1
> or condition2 for example.
>
>
>
> Spark will use broadcastNestedLoopJoin to do this. Assume that one of
> dataframe(df1) is not created from hive table nor local collection and the
> other one is created from hivetable(df2). For df1, spark will use
> defaultSizeInBytes * length of df1 to estimate the size of df1 and use
> correct size for df2.
>
>
>
> As the result, in most cases, spark will think df1 is bigger than df2 even
> df2 is really huge. And spark will do df2.collect(), which will cause error
> or slowness of program.
>
>
>
> Maybe we should just use defaultSizeInBytes for logicalRDD, not
> defaultSizeInBytes * length?
>
>
>
> Hope this could be helpful
>
> Thanks a lot in advance for your help and input.
>
>
>
> Cheers
>
> Gen
>
>
>