Hello,
I am attempting to setup a long running, shared Hive metastore in AWS. The intention is to have this serve as the core repository of metadata for shared datasets across multiple AWS accounts. Users will be able to spin up their own short-lived EMR clusters, Spark jobs, etc. and then locate the data that they need using this metastore. The data will be stored on S3, the metadata database will be provided using RDS MySQL or Aurora, and I have the metastore service running on EC2 instances. I’m trying to determine what would be the best way to both authenticate and authorize users of the metastore in this scenario. Given that I’m no expert on user identity management and security, I’m finding it rather difficult to make headway. On the subject of authentication, I’d ideally like to use the user’s global IAM identity. However, I’m at a loss on where and how I can integrate this with the metastore service. The metastore apparently supports Kerberos and LDAP but I’m note sure how these fit into an AWS setting. I’d rather not run a separate directory server that maintains a set of identities separate from the IAM identities in accounts, although this seems to be a possibility. On the subject of authorisation, I suspect that storage based authorisation will not work with S3. Hive appears to use the Hadoop FileSystem abstraction to interrogate FS permissions and the S3 FileSystem implementations do not appear to provide any visibility on S3 bucket permissions. Additionally, SQL based authorization also appears to be inappropriate for this use case as it requires HiveServer2 to enforce the finer grained permissions (column access control for example). However, I don’t want to force all users to access data via HiveServer2 as this then mandates that they must use a client that supports HS2. At this point I wonder whether I must implement my own metastore authorization hook that interrogates the S3 bucket policy using AWS apis. Any suggestions or thoughts would be appreciated. Thanks, Elliot.