Hi Stephen,

I like your idea to have a Sentry-like plugin without changing the
permission model. I will update once we finish it.

Xinli

On Mon, Oct 26, 2020 at 4:56 AM Stephen O'Donnell
<sodonn...@cloudera.com.invalid> wrote:

> My concern about this is that it breaks the posix compliance that HDFS
> otherwise sticks to as closely as possible. If we do this, then it opens
> the door for other "non-posix" things, which is a slippery slope.
>
> I would like to understand where Sentry breaks down performance wise, as
> management problems aside, it does what you describe, as it provides
> permissions on files based on the table level folder permission grants.
>
> It would also be possible to build a Sentry like plugin which provides
> permissions in a similar way to what you describe, without moving this
> change into the core or adding extra permission bits.
>
> With your suggestion, when you check the permission for a file, it will
> need to walk back the directory path looking for the special flag and if it
> finds it, give the permissions from that folder. That folder may even
> inherit permission from another parent folder. This is exactly what you can
> do with a custom permission provider. Sentry intercepts the path, and
> checks if the permissions have been provided from SentryDB, but it could
> just as easily pull the permissions for the files from their parent folder
> or check up the  directory hierarchy. Perhaps an extended attribute could
> be applied to a folder you would like permissions to be inherited from and
> the plugin could use that?
>
> On Fri, Oct 23, 2020 at 5:11 PM Xinli shang <sha...@uber.com.invalid>
> wrote:
>
> > So far we have feedbacks regarding the use of default ACLs and Sentry
> with
> > the NameNode plugin. Are there concerns about the risk of regression if
> we
> > add this feature?
> >
> > On Mon, Oct 19, 2020 at 7:19 AM Xinli shang <sha...@uber.com> wrote:
> >
> > > Thanks Stephen for sharing out!
> > >
> > > I don't have those details as our benchmarking test was done quite a
> > while
> > > ago and I wasn't a participant. But using Sentry does add one more
> > service
> > > to HDFS critical path which will add overhead more or less especially
> for
> > > reliability. In a long run, we consider completely removing Sentry
> > because
> > > we have company-wide generic policy stores, and using Sentry forces us
> to
> > > do the conversion from authoritative to Sentry police store, which
> > > introduces some problems.
> > >
> > > For your point of "*The ACLs at the HDFS level, if you have a lot of
> > > them, will certainly add to NN heap pressure*", that is another reason
> > > why I propose this HDFS-15638 to use table-level directory ACLs instead
> > of
> > > file-level.
> > >
> > > It seems GCP/GCS already has the prefix-based permission granting.
> Check
> > > it out here
> > > https://cloud.google.com/iam/docs/configuring-resource-based-access.
> > >
> > > Xinli
> > >
> > >
> > >
> > >
> > > On Mon, Oct 19, 2020 at 2:14 AM Stephen O'Donnell
> > > <sodonn...@cloudera.com.invalid> wrote:
> > >
> > >> Were you able to trace where the Sentry plugin was causing problems?
> Was
> > >> it
> > >> during the initial sync of ACLs, or updates of ACLs, or during
> > permission
> > >> lookups when accessing files?
> > >>
> > >> Some time back, I dealt with a ~230M file cluster which was using
> > Sentry.
> > >> Accidentally, all the Sentry provided ACLs got copied down to the HDFS
> > >> level. At that point, we noted a large increase in NN heap usage. I
> > cannot
> > >> recall exactly, but it was in the order of 10's of GB. Heap dumps
> showed
> > >> it
> > >> was down to the ACLs. Once we removed them at the HDFS level, heap
> usage
> > >> returned to normal.
> > >>
> > >> The ACLs at the HDFS level, if you have a lot of them, will certainly
> > add
> > >> to NN heap pressure.
> > >>
> > >> On Sun, Oct 18, 2020 at 2:43 PM Xinli shang <sha...@uber.com.invalid>
> > >> wrote:
> > >>
> > >> > We are using Apache Sentry. On the large scale of HDFS, which is our
> > >> case,
> > >> > we see a performance downgrade when enabling the Sentry plugin in
> > >> > NameNode.  So we have to disable the plugin in NN and map Sentry
> > >> policies
> > >> > to HDFS ACL. It works great so far. This is the only major issue we
> > see.
> > >> >
> > >> > On Sun, Oct 18, 2020 at 1:19 AM Stephen O'Donnell
> > >> > <sodonn...@cloudera.com.invalid> wrote:
> > >> >
> > >> > > I agree with Owen on this - I don't think this is a feature we
> > should
> > >> add
> > >> > > to HDFS.
> > >> > >
> > >> > > If managing the permissions for Hive tables is becoming a big
> > overhead
> > >> > for
> > >> > > you, you should look into something like Sentry. It allows you to
> > >> manage
> > >> > > the permissions of all the files and folders under Hive tables in
> a
> > >> > > centralized place. It is also more memory efficient inside the
> > >> namenode,
> > >> > as
> > >> > > it does not store an ACL object against each file. Sentry also
> > allows
> > >> for
> > >> > > more than 32 ACLs, which is the normal HDFS limit. At Cloudera, we
> > >> see a
> > >> > > lot of clusters using Sentry to manage Hive table permissions.
> > >> > >
> > >> > > Sentry simply uses the existing HDFS Attribute Provider interface,
> > so
> > >> it
> > >> > > theory, it would be fairly simple to create a plugin of your own
> to
> > do
> > >> > just
> > >> > > what you need, but as Sentry exists and is fairly well proven in
> > Hive
> > >> > > environments already, it would be simpler to just use it.
> > >> > >
> > >> > >
> > >> > > On Sat, Oct 17, 2020 at 3:23 PM Xinli shang
> <sha...@uber.com.invalid
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Vinayakumar,
> > >> > > >
> > >> > > > The staging tables are dynamic. From the Hadoop security team
> > >> > > perspective,
> > >> > > > it is unrealistic to force every data writer to do that because
> > they
> > >> > are
> > >> > > so
> > >> > > > many and they write in different ways.
> > >> > > >
> > >> > > > Rename is just one scenario and there are other scenarios. For
> > >> example,
> > >> > > > when permission is changed, we need to apply that change to
> every
> > >> file
> > >> > > > today. If we can have that flag, we only change the table. or
> > >> > > > partition directories.
> > >> > > >
> > >> > > > Xinli
> > >> > > >
> > >> > > >
> > >> > > > On Sat, Oct 17, 2020 at 12:14 AM Vinayakumar B <
> > >> > vinayakum...@apache.org>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > IIUC, hive renames are from hive’s staging directory during
> > write
> > >> to
> > >> > > > final
> > >> > > > > destination within table.
> > >> > > > >
> > >> > > > > Why not set the default ACLs of staging directory to whatever
> > >> > expected,
> > >> > > > and
> > >> > > > > then continue write remaining files.
> > >> > > > >
> > >> > > > > In this way even after rename you will have expected ACLs on
> the
> > >> > final
> > >> > > > > files.
> > >> > > > >
> > >> > > > > Setting default ACLs on staging directory can be done using
> > single
> > >> > RPC.
> > >> > > > >
> > >> > > > > -Vinay
> > >> > > > >
> > >> > > > > On Sat, 17 Oct 2020 at 8:08 AM, Xinli shang
> > >> <sha...@uber.com.invalid
> > >> > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Thanks Owen for your reply! As mentioned in the Jira,
> default
> > >> ACLs
> > >> > > > don't
> > >> > > > > > apply to rename. Any idea how rename can work without
> setting
> > >> ACLs
> > >> > > per
> > >> > > > > > file?
> > >> > > > > >
> > >> > > > > > On Fri, Oct 16, 2020 at 7:25 PM Owen O'Malley <
> > >> > > owen.omal...@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > I'm very -1 on adding these semantics.
> > >> > > > > > >
> > >> > > > > > > When you create the table's directory, set the default
> ACL.
> > >> That
> > >> > > will
> > >> > > > > > have
> > >> > > > > > > exactly the effect that you are looking for without
> creating
> > >> > > > additional
> > >> > > > > > > semantics.
> > >> > > > > > >
> > >> > > > > > > .. Owen
> > >> > > > > > >
> > >> > > > > > > On Fri, Oct 16, 2020 at 7:02 PM Xinli shang
> > >> > > <sha...@uber.com.invalid
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi all,
> > >> > > > > > > >
> > >> > > > > > > > I opened
> https://issues.apache.org/jira/browse/HDFS-15638
> > >> and
> > >> > > want
> > >> > > > > to
> > >> > > > > > > > collect feedback from the community. I know whenever
> > >> changing
> > >> > the
> > >> > > > > > > > permission model that follows POSIX model is never a
> > trivial
> > >> > > > change.
> > >> > > > > So
> > >> > > > > > > > please comment on if you have concerns. For reading
> > >> > convenience,
> > >> > > > here
> > >> > > > > > is
> > >> > > > > > > a
> > >> > > > > > > > copy of the ticket.
> > >> > > > > > > >
> > >> > > > > > > > *Problem*: Currently, when a user tries to accesses a
> file
> > >> > he/she
> > >> > > > > needs
> > >> > > > > > > the
> > >> > > > > > > > permissions of it's parent and ancestors and the
> > permission
> > >> of
> > >> > > that
> > >> > > > > > file.
> > >> > > > > > > > This is correct generally, but for Hive tables
> > >> > directories/files,
> > >> > > > all
> > >> > > > > > the
> > >> > > > > > > > files under a partition or even a table usually have the
> > >> same
> > >> > > > > > permissions
> > >> > > > > > > > for the same set of ACL groups. Although the permissions
> > and
> > >> > ACL
> > >> > > > > groups
> > >> > > > > > > are
> > >> > > > > > > > the same, the writer still need to call setfacl() for
> > every
> > >> > file
> > >> > > to
> > >> > > > > add
> > >> > > > > > > > LDAP groups. This results in a huge amount of RPC calls
> to
> > >> NN.
> > >> > > HDFS
> > >> > > > > has
> > >> > > > > > > > default ACL to solve that but that only applies to
> create
> > >> and
> > >> > > copy,
> > >> > > > > but
> > >> > > > > > > not
> > >> > > > > > > > apply to rename. However, in Hive ETL, rename is very
> > >> common.
> > >> > > > > > > >
> > >> > > > > > > > *Proposal*: Add a 1-bit flag to directory inodes to
> > indicate
> > >> > > > whether
> > >> > > > > or
> > >> > > > > > > not
> > >> > > > > > > > it is a Hive table directory. If that flag is set, then
> > all
> > >> the
> > >> > > > > > > > sub-directory and files under it will just use it's
> > >> permission
> > >> > > and
> > >> > > > > ACL
> > >> > > > > > > > groups settings. By doing this way, Hive ETL doesn't
> need
> > to
> > >> > set
> > >> > > > > > > > permissions at the file level. If that flag is not
> set(by
> > >> > > default),
> > >> > > > > > work
> > >> > > > > > > as
> > >> > > > > > > > before. To set/unset that flag, it would require admin
> > >> > privilege.
> > >> > > > > > > >
> > >> > > > > > > > --
> > >> > > > > > > > Xinli Shang
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Xinli Shang
> > >> > > > > >
> > >> > > > > --
> > >> > > > > -Vinay
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Xinli Shang
> > >> >
> > >>
> > >
> > >
> > > --
> > > Xinli Shang
> > >
> >
> >
> > --
> > Xinli Shang
> >
>


-- 
Xinli Shang

Reply via email to