Hi again, I finally got around to creating a Wiki page: < https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables >
@Sergey: What are "SH tables"? Any more additions to the content are welcome. I'd be happy to put them in the Wiki too. All I need are a few cues/keywords. Thanks! Lars On Thu, Aug 23, 2018 at 12:22 AM Lars Francke <lars.fran...@gmail.com> wrote: > Thanks for the input here and on the issue itself. > > @Sergey: That is a good summary. I would try to distill that into a Wiki > page. > > I still think that it'd be good to list the actual technical differences. > > 1) Because this is not actually written down in a concise way so lots of > our users don't know this distinction. I for example have heard all kinds > of "rumors" > > 2) I've been using Hive for 8+ years or so and I use it almost weekly and > follow the dev list and Jira issues and I still missed about 80% of the > features that have been implemented in the last few years affecting this. > And these things can change a decision. Back when the only difference was > whether a DROP would delete data it was much simpler. Now you might opt for > an internal table even if it does not match your ETL flows etc. because it > gives you better performance or some other feature. > > 3) It's a pity to have all these features not being used because no one > outside of a small circle knows about them :) > > So I'd still appreciate any and all input you might have for me. > > Cheers, > Lars > > > On Wed, Aug 22, 2018 at 9:45 PM, Sergey Shelukhin <ser...@hortonworks.com> > wrote: > >> I think it’s very simple conceptually (well the way I’ve been thinking >> about it), and any limitations can be derived from this using logic. If >> someone has specific questions we can clarify; if some limitations are >> missing (e.g. when Hive was using stats for hbase tables) it is a bug. >> >> For managed tables, Hive owns and manages the data: >> >> * Data, properties, etc. can only be changed via Hive commands or APIs; >> Hive assumes it’s true and any other modification on the underlying >> storage is undefined behavior (not just results - basically all bets are >> off ;)). >> * Data is attached to the Hive entity (partition, table). Whatever happens >> to the entity happens to the data (e.g. the behavior mentioned in the >> wiki, where dropping the table/partition deletes the data). >> >> >> For external tables (also SH tables before Hive made those explicitly >> external), Hive doesn’t manage or own the data: >> >> * Hive assumes the data can be modified externally at any time. Any >> feature that requires assumptions about data being unchanged (queries from >> stats, ACID, some MV, query cache) won’t work. If it works, it’s a bug. >> * Data is detached from Hive entities, e.g. dropping the table by default >> doesn’t delete the data - we assume someone else owns it and we are just >> pointing at it. >> * Additionally, since Hive cannot track modifications, external >> products/users/etc. are fully responsible for any concurrency control. In >> particular, modifying data under the running query may produce >> inconsistent results (no atomicity between tasks), undefined results due >> to retries, etc. >> >> >> >> >> >> On 18/8/21, 10:31, "Mithun RK" <mythro...@gmail.com> wrote: >> >> >+1. I'm out of the loop as well. I'd be keen on grokking what's not >> >available with EXTERNAL tables, given that these are popular on our >> >clusters as well. It would be good to have that documented in one place. >> > >> >On Tue, Aug 21, 2018 at 8:19 AM Lars Francke <lars.fran...@gmail.com> >> >wrote: >> > >> >> As this came up with yet another customer and I've heard more things >> >>that I >> >> believe to be false: >> >> >> >> Please, anyone have any kind of feedback for me here? Is my list >> >>correct, >> >> can you add stuff to it, have more details to certain points etc.? >> >> >> >> Thank you! >> >> >> >> On Wed, Aug 1, 2018 at 12:17 PM, Lars Francke <lars.fran...@gmail.com> >> >> wrote: >> >> >> >> > I have opened https://issues.apache.org/jira/browse/HIVE-20287 and >> >>would >> >> > love to get feedback >> >> > >> >> > On Wed, Aug 1, 2018 at 11:14 AM, Lars Francke < >> lars.fran...@gmail.com> >> >> > wrote: >> >> > >> >> >> Thanks Lefty! I haven't created a JIRA but it's on my list. >> >> >> >> >> >> Still interested in any input anyone might have? >> >> >> >> >> >> Some of you guys _must_ have some knowledge on this stuff. >> >> >> >> >> >> You're correct that the DROP stuff is documented. What I meant is >> >>that I >> >> >> couldn't find the relevant parts in the code. >> >> >> >> >> >> On Sat, Jul 21, 2018 at 6:15 AM, Lefty Leverenz < >> >> leftylever...@gmail.com> >> >> >> wrote: >> >> >> >> >> >>> Agreed, the Hive wiki is woefully incomplete. But it does mention >> >>DROP >> >> >>> for >> >> >>> external tables in two sections of the DDL doc: >> >> >>> >> >> >>> - External Tables >> >> >>> <https://cwiki.apache.org/confluence/display/Hive/LanguageM >> >> >>> anual+DDL#LanguageManualDDL-ExternalTables>: >> >> >>> "When dropping an EXTERNAL table, data in the table is NOT >> >>deleted >> >> >>> from >> >> >>> the file system." >> >> >>> - Drop Table >> >> >>> <https://cwiki.apache.org/confluence/display/Hive/LanguageM >> >> >>> anual+DDL#LanguageManualDDL-DropTable>: >> >> >>> "When dropping an EXTERNAL table, data in the table will NOT be >> >> >>> deleted >> >> >>> from the file system." >> >> >>> >> >> >>> A wiki page comparing managed and external tables would be very >> >> helpful, >> >> >>> and the list is a good start. I suggest you open a JIRA issue. >> >> >>> >> >> >>> Who can help Lars with this task? >> >> >>> >> >> >>> -- Lefty >> >> >>> >> >> >>> >> >> >>> On Thu, Jul 19, 2018 at 3:04 PM Lars Francke >> >><lars.fran...@gmail.com> >> >> >>> wrote: >> >> >>> >> >> >>> > Hi, >> >> >>> > >> >> >>> > I've been wondering if anyone can tell me what the differences >> and >> >> >>> > limitations of managed vs. external tables are these days. >> >> >>> > >> >> >>> > I've seen the docs from Hortonworks[1] that list these features >> as >> >> not >> >> >>> > supported by External tables: >> >> >>> > * Query cache >> >> >>> > * Materialized views, except in a limited way >> >> >>> > * Default statistics gathering >> >> >>> > * Compute queries using statistics >> >> >>> > * Automatic runtime filtering >> >> >>> > * File merging after insert >> >> >>> > >> >> >>> > But there are no details. For some of them I'm not even sure what >> >> they >> >> >>> are >> >> >>> > supposed to mean because I couldn't find any documentation in our >> >> Wiki >> >> >>> :( >> >> >>> > >> >> >>> > So I'd love to create a Wiki page that describes the differences >> >> >>> between >> >> >>> > managed & external tables and list relevant JIRAs but I need your >> >> help >> >> >>> for >> >> >>> > that. >> >> >>> > >> >> >>> > I scanned the code for all references to TableType and this is >> the >> >> >>> list I >> >> >>> > found but I'm almost certainly missing something: >> >> >>> > >> >> >>> > * ARCHIVE/UNARCHIVE - DDLTask - Only works for managed tables >> >> >>> > * TRUNCATE - DDLSemanticAnalyzer - Only works for managed tables >> >> >>> > * MERGE/CONCATENATE - HiveRelOpMaterializationValidator - Only >> >>works >> >> >>> for >> >> >>> > managed tables >> >> >>> > * Constraints - DDLSemanticAnalyzer - (NOT NULL, DEFAULT, CHECK, >> >> only >> >> >>> RELY >> >> >>> > ist allowed) >> >> >>> > * IMPORT - ImportSemanticAnalyzer - This has some wild >> >>restrictions I >> >> >>> > didn't follow for external tables >> >> >>> > * Query Results Caching - https://issues.apache.org/jira >> >> >>> /browse/HIVE-18513 >> >> >>> > SemanticAnalyzer - Documentation missing for Results Cache >> >> >>> > >> >> >>> > So there's a bunch of those from the Hortonworks list missing. If >> >> >>> anyone >> >> >>> > could point me to documentation or code for those that'd be >> great. >> >> >>> > >> >> >>> > I also didn't find the most obvious/oldest one: Data is not >> >>deleted >> >> for >> >> >>> > EXTERNAL tables on DROP, I'm sure I just missed it. >> >> >>> > >> >> >>> > And last but not least: I have not looked at the grammar to see >> >> what's >> >> >>> > already forbidden at that level. >> >> >>> > >> >> >>> > As I said: Any hints would be greatly appreciated. >> >> >>> > >> >> >>> > Thank you! >> >> >>> > >> >> >>> > Lars >> >> >>> > >> >> >>> > A side note: There are so many fantastic features in Hive but >> >>because >> >> >>> lots >> >> >>> > of them are not or under-documented most people I know just stick >> >>to >> >> >>> the >> >> >>> > basics. That's a pity :( >> >> >>> > >> >> >>> > [1] < >> >> >>> > >> >> >>> > https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/usi >> >> >>> ng-hiveql/content/hive_hive_3_tables.html >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> >> >> >> >> >> > >> >> >> >> >