Re: Scaling partitioned Hive table support

2016-08-09 Thread Michael Allman
Hi Eric,

I've rebased my first patch to master and created a Jira issue for tracking: 
https://issues.apache.org/jira/browse/SPARK-16980 
. As mentioned in the issue, 
I will open a PR for discussion and design review, and include you in the 
conversation.

Cheers,

Michael


> On Aug 8, 2016, at 12:51 PM, Eric Liang  wrote:
> 
> I like the former approach -- it seems more generally applicable to other 
> catalogs and IIUC would let you defer pruning until execution time. Pruning 
> is work that should be done by the catalog anyways, as is the case when 
> querying over an (unconverted) hive table.
> 
> You might also want to look at https://github.com/apache/spark/pull/14241 
>  , which refactors some of the 
> file scan execution to defer pruning.
> 
> 
> On Mon, Aug 8, 2016, 11:53 AM Michael Allman  > wrote:
> Hello,
> 
> I'd like to propose a modification in the way Hive table partition metadata 
> are loaded and cached. Currently, when a user reads from a partitioned Hive 
> table whose metadata are not cached (and for which Hive table conversion is 
> enabled and supported), all partition metadata is fetched from the metastore:
> 
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
>  
> 
> 
> This is highly inefficient in some scenarios. In the most extreme case, a 
> user starts a new Spark app, runs a query which reads from a single partition 
> in a table with a large number of partitions and terminates their app. All 
> partition metadata are loaded and their files' schema are merged, but only a 
> single partition is read. Instead, I propose we load and cache partition 
> metadata on-demand, as needed to build query plans.
> 
> We've long encountered this performance problem at VideoAmp and have taken 
> different approaches to address it. In addition to the load time, we've found 
> that loading all of a table's partition metadata can require a significant 
> amount of JVM heap space. Our largest tables OOM our Spark drivers unless we 
> allocate several GB of heap space.
> 
> Certainly one could argue that our situation is pathological and rare, and 
> that the problem in our scenario is with the design of our tables—not Spark. 
> However, even in tables with more modest numbers of partitions, loading only 
> the necessary partition metadata and file schema can significantly reduce the 
> query planning time, and is definitely more memory efficient.
> 
> I've written POCs for a couple of different implementation approaches. Though 
> incomplete, both have been successful in their basic goal. The first extends 
> `org.apache.spark.sql.catalyst.catalog.ExternalCatalog` and as such is more 
> general. It requires some new abstractions and refactoring of 
> `HadoopFsRelation` and `FileCatalog`, among others. It places a greater 
> burden on other implementations of `ExternalCatalog`. Currently the only 
> other implementation of `ExternalCatalog` is `InMemoryCatalog`, and my code 
> throws an `UnsupportedOperationException` on that implementation.
> 
> The other approach is simpler and only touches code in the codebase's `hive` 
> project. Basically, conversion of `MetastoreRelation` to `HadoopFsRelation` 
> is deferred to physical planning when the metastore relation is partitioned. 
> During physical planning, the partition pruning filters in a logical query 
> plan are used to select the required partition metadata and a 
> `HadoopFsRelation` is built from those. The new logical plan is then 
> re-injected into the planner.
> 
> I'd like to get the community's thoughts on my proposal and implementation 
> approaches.
> 
> Thanks!
> 
> Michael



Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
alrighty then!

bcc'ing user list.  cc'ing dev list.

@user list people:  do not read any further or you will be in violation of
ASF policies!

On Tue, Aug 9, 2016 at 11:50 AM, Mark Hamstra 
wrote:

> That's not going to happen on the user list, since that is against ASF
> policy (http://www.apache.org/dev/release.html):
>
> During the process of developing software and preparing a release, various
>> packages are made available to the developer community for testing
>> purposes. Do not include any links on the project website that might
>> encourage non-developers to download and use nightly builds, snapshots,
>> release candidates, or any other similar package. The only people who
>> are supposed to know about such packages are the people following the dev
>> list (or searching its archives) and thus aware of the conditions placed on
>> the package. If you find that the general public are downloading such test
>> packages, then remove them.
>>
>
> On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly  wrote:
>
>> this is a valid question.  there are many people building products and
>> tooling on top of spark and would like access to the latest snapshots and
>> such.  today's ink is yesterday's news to these people - including myself.
>>
>> what is the best way to get snapshot releases including nightly and
>> specially-blessed "preview" releases so that we, too, can say "try the
>> latest release in our product"?
>>
>> there was a lot of chatter during the 2.0.0/2.0.1 release that i largely
>> ignored because of conflicting/confusing/changing responses.  and i'd
>> rather not dig through jenkins builds to figure this out as i'll likely get
>> it wrong.
>>
>> please provide the relevant snapshot/preview/nightly/whatever repos (or
>> equivalent) that we need to include in our builds to have access to the
>> absolute latest build assets for every major and minor release.
>>
>> thanks!
>>
>> -chris
>>
>>
>> On Tue, Aug 9, 2016 at 10:00 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> LOL
>>>
>>> Ink has not dried on Spark 2 yet so to speak :)
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 9 August 2016 at 17:56, Mark Hamstra  wrote:
>>>
 What are you expecting to find?  There currently are no releases beyond
 Spark 2.0.0.

 On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma 
 wrote:

> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I 
> can't
> find the newer versions on Maven central.
>
> Thank you!
> Jestin
>


>>>
>>
>>
>> --
>> *Chris Fregly*
>> Research Scientist @ PipelineIO
>> San Francisco, CA
>> pipeline.io
>> advancedspark.com
>>
>>
>


-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com