Hi Markus, you're on the right track but not quite and got a few things wrong/confused.
1) Managed tables are purely a Hive feature, it has nothing to do with the underlying storage. Hive _assumes_ that it has full (and sole) control over the data when it's a managed table. But nothing is stopping you from having your data owned by someone other than Hive as long as Hive can access the data. This happens quite frequently when other tools ingest data into a directory that hive uses (whether this would be better as an external table is another discussion). 2) When you want to access data through Hive, it can consult an authorization plugin (technically not 100% correct but good enough for now) and ask "Hey is this user allowed to do that action?", this plugin can then ask Sentry or Ranger or something else for a decision. In Ranger you specify ACLs for users but - in this case - you'd specify them for Hive objects. e.g. user Marko is allowed to look at the "customers" table. That does not give you _any_ automatic permissions on HDFS so you're safe there. Sentry works slightly different: Here you grant access to Hive objects as well but Sentry can automatically also grant HDFS access. If you have the permission to SELECT * a table then you can also read the data straight from HDFS. So, the scenario you outlined shouldn't happen when you use Ranger but can happen when you use Sentry. These two projects might merge in the future now that Cloudera and Hortonworks have merged. We'll see. You are correct though that an external user shouldn't meddle with "internals" but I see no harm in getting read-only access. Hope that helps. Cheers, Lars On Fri, Jun 21, 2019 at 5:00 PM Marko Bauhardt <m...@datameer.com> wrote: > Hi all, > I have a question about Hive3 Managed Tables and how they should be used > in a production environment, lets say in an enterprise environment. > > As far as I understand, managed tables has a helpful set of features. > See > https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_hive_3_tables.html > So I see many reasons to use managed tables instead external tables. > > The hive documention says that the data of managed tables is completely > managed by Hive. That means the managed table space (hdfs path) is owned > by the user `hive`. And only the owner has `rwx` to this path. No one > else. So using `beeline` with another user than `hive` or even with > `hive` but with impersonation/proxy-user does not give me the access to > the data > via select statement. > > In an enterprise environment impersonation plays an important role. To > allow access to the data `ranger` (in HDP) comes into the game. > Is my assumption correct to use `ranger` to set ACL's to > allow a set of groups/users the access to the path of specific *managed* > tables? > > Second question... > If ranger opens the door to the data, i'm able to read the data directly > from the HDFS, lets say with a third party tool. But I believe this is > not a good option based on the fact how Hive is working with > transactional tables. See > > https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html > What I mean is the usage of deltas/buckets etc. Do you agree, direct > access to the HDFS files in the managed table space is not recommended? > > Thanks, > Marko >