Hi again,

I finally got around to creating a Wiki page: <
https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables
>

@Sergey: What are "SH tables"?

Any more additions to the content are welcome. I'd be happy to put them in
the Wiki too. All I need are a few cues/keywords.

Thanks!
Lars

On Thu, Aug 23, 2018 at 12:22 AM Lars Francke <lars.fran...@gmail.com>
wrote:

> Thanks for the input here and on the issue itself.
>
> @Sergey: That is a good summary. I would try to distill that into a Wiki
> page.
>
>  I still think that it'd be good to list the actual technical differences.
>
> 1) Because this is not actually written down in a concise way so lots of
> our users don't know this distinction. I for example have heard all kinds
> of "rumors"
>
> 2) I've been using Hive for 8+ years or so and I use it almost weekly and
> follow the dev list and Jira issues and I still missed about 80% of the
> features that have been implemented in the last few years affecting this.
> And these things can change a decision. Back when the only difference was
> whether a DROP would delete data it was much simpler. Now you might opt for
> an internal table even if it does not match your ETL flows etc. because it
> gives you better performance or some other feature.
>
> 3) It's a pity to have all these features not being used because no one
> outside of a small circle knows about them :)
>
> So I'd still appreciate any and all input you might have for me.
>
> Cheers,
> Lars
>
>
> On Wed, Aug 22, 2018 at 9:45 PM, Sergey Shelukhin <ser...@hortonworks.com>
> wrote:
>
>> I think it’s very simple conceptually (well the way I’ve been thinking
>> about it), and any limitations can be derived from this using logic. If
>> someone has specific questions we can clarify; if some limitations are
>> missing (e.g. when Hive was using stats for hbase tables) it is a bug.
>>
>> For managed tables, Hive owns and manages the data:
>>
>> * Data, properties, etc. can only be changed via Hive commands or APIs;
>> Hive assumes it’s true and any other modification on the underlying
>> storage is undefined behavior (not just results - basically all bets are
>> off ;)).
>> * Data is attached to the Hive entity (partition, table). Whatever happens
>> to the entity happens to the data (e.g. the behavior mentioned in the
>> wiki, where  dropping the table/partition deletes the data).
>>
>>
>> For external tables (also SH tables before Hive made those explicitly
>> external), Hive doesn’t manage or own the data:
>>
>> * Hive assumes the data can be modified externally at any time. Any
>> feature that requires assumptions about data being unchanged (queries from
>> stats, ACID, some MV, query cache) won’t work. If it works, it’s a bug.
>> * Data is detached from Hive entities, e.g. dropping the table by default
>> doesn’t delete the data - we assume someone else owns it and we are just
>> pointing at it.
>> * Additionally, since Hive cannot track modifications, external
>> products/users/etc. are fully responsible for any concurrency control. In
>> particular, modifying data under the running query may produce
>> inconsistent results (no atomicity between tasks), undefined results due
>> to retries, etc.
>>
>>
>>
>>
>>
>> On 18/8/21, 10:31, "Mithun RK" <mythro...@gmail.com> wrote:
>>
>> >+1. I'm out of the loop as well. I'd be keen on grokking what's not
>> >available with EXTERNAL tables, given that these are popular on our
>> >clusters as well. It would be good to have that documented in one place.
>> >
>> >On Tue, Aug 21, 2018 at 8:19 AM Lars Francke <lars.fran...@gmail.com>
>> >wrote:
>> >
>> >> As this came up with yet another customer and I've heard more things
>> >>that I
>> >> believe to be false:
>> >>
>> >> Please, anyone have any kind of feedback for me here? Is my list
>> >>correct,
>> >> can you add stuff to it, have more details to certain points etc.?
>> >>
>> >> Thank you!
>> >>
>> >> On Wed, Aug 1, 2018 at 12:17 PM, Lars Francke <lars.fran...@gmail.com>
>> >> wrote:
>> >>
>> >> > I have opened https://issues.apache.org/jira/browse/HIVE-20287 and
>> >>would
>> >> > love to get feedback
>> >> >
>> >> > On Wed, Aug 1, 2018 at 11:14 AM, Lars Francke <
>> lars.fran...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Thanks Lefty! I haven't created a JIRA but it's on my list.
>> >> >>
>> >> >> Still interested in any input anyone might have?
>> >> >>
>> >> >> Some of you guys _must_ have some knowledge on this stuff.
>> >> >>
>> >> >> You're correct that the DROP stuff is documented. What I meant is
>> >>that I
>> >> >> couldn't find the relevant parts in the code.
>> >> >>
>> >> >> On Sat, Jul 21, 2018 at 6:15 AM, Lefty Leverenz <
>> >> leftylever...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> Agreed, the Hive wiki is woefully incomplete.  But it does mention
>> >>DROP
>> >> >>> for
>> >> >>> external tables in two sections of the DDL doc:
>> >> >>>
>> >> >>>    - External Tables
>> >> >>>    <https://cwiki.apache.org/confluence/display/Hive/LanguageM
>> >> >>> anual+DDL#LanguageManualDDL-ExternalTables>:
>> >> >>>     "When dropping an EXTERNAL table, data in the table is NOT
>> >>deleted
>> >> >>> from
>> >> >>>    the file system."
>> >> >>>    - Drop Table
>> >> >>>    <https://cwiki.apache.org/confluence/display/Hive/LanguageM
>> >> >>> anual+DDL#LanguageManualDDL-DropTable>:
>> >> >>>     "When dropping an EXTERNAL table, data in the table will NOT be
>> >> >>> deleted
>> >> >>>    from the file system."
>> >> >>>
>> >> >>> A wiki page comparing managed and external tables would be very
>> >> helpful,
>> >> >>> and the list is a good start.  I suggest you open a JIRA issue.
>> >> >>>
>> >> >>> Who can help Lars with this task?
>> >> >>>
>> >> >>> -- Lefty
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Jul 19, 2018 at 3:04 PM Lars Francke
>> >><lars.fran...@gmail.com>
>> >> >>> wrote:
>> >> >>>
>> >> >>> > Hi,
>> >> >>> >
>> >> >>> > I've been wondering if anyone can tell me what the differences
>> and
>> >> >>> > limitations of managed vs. external tables are these days.
>> >> >>> >
>> >> >>> > I've seen the docs from Hortonworks[1] that list these features
>> as
>> >> not
>> >> >>> > supported by External tables:
>> >> >>> > * Query cache
>> >> >>> > * Materialized views, except in a limited way
>> >> >>> > * Default statistics gathering
>> >> >>> > * Compute queries using statistics
>> >> >>> > * Automatic runtime filtering
>> >> >>> > * File merging after insert
>> >> >>> >
>> >> >>> > But there are no details. For some of them I'm not even sure what
>> >> they
>> >> >>> are
>> >> >>> > supposed to mean because I couldn't find any documentation in our
>> >> Wiki
>> >> >>> :(
>> >> >>> >
>> >> >>> > So I'd love to create a Wiki page that describes the differences
>> >> >>> between
>> >> >>> > managed & external tables and list relevant JIRAs but I need your
>> >> help
>> >> >>> for
>> >> >>> > that.
>> >> >>> >
>> >> >>> > I scanned the code for all references to TableType and this is
>> the
>> >> >>> list I
>> >> >>> > found but I'm almost certainly missing something:
>> >> >>> >
>> >> >>> > * ARCHIVE/UNARCHIVE - DDLTask - Only works for managed tables
>> >> >>> > * TRUNCATE - DDLSemanticAnalyzer - Only works for managed tables
>> >> >>> > * MERGE/CONCATENATE - HiveRelOpMaterializationValidator - Only
>> >>works
>> >> >>> for
>> >> >>> > managed tables
>> >> >>> > * Constraints - DDLSemanticAnalyzer -  (NOT NULL, DEFAULT, CHECK,
>> >> only
>> >> >>> RELY
>> >> >>> > ist allowed)
>> >> >>> > * IMPORT - ImportSemanticAnalyzer - This has some wild
>> >>restrictions I
>> >> >>> > didn't follow for external tables
>> >> >>> > * Query Results Caching - https://issues.apache.org/jira
>> >> >>> /browse/HIVE-18513
>> >> >>> > SemanticAnalyzer - Documentation missing for Results Cache
>> >> >>> >
>> >> >>> > So there's a bunch of those from the Hortonworks list missing. If
>> >> >>> anyone
>> >> >>> > could point me to documentation or code for those that'd be
>> great.
>> >> >>> >
>> >> >>> > I also didn't find the most obvious/oldest one: Data is not
>> >>deleted
>> >> for
>> >> >>> > EXTERNAL tables on DROP, I'm sure I just missed it.
>> >> >>> >
>> >> >>> > And last but not least: I have not looked at the grammar to see
>> >> what's
>> >> >>> > already forbidden at that level.
>> >> >>> >
>> >> >>> > As I said: Any hints would be greatly appreciated.
>> >> >>> >
>> >> >>> > Thank you!
>> >> >>> >
>> >> >>> > Lars
>> >> >>> >
>> >> >>> > A side note: There are so many fantastic features in Hive but
>> >>because
>> >> >>> lots
>> >> >>> > of them are not or under-documented most people I know just stick
>> >>to
>> >> >>> the
>> >> >>> > basics. That's a pity :(
>> >> >>> >
>> >> >>> > [1] <
>> >> >>> >
>> >> >>> > https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/usi
>> >> >>> ng-hiveql/content/hive_hive_3_tables.html
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >>
>>
>>
>

Reply via email to