Thank you for the breakdown of the problem.

Option (1) or (2) would be the way to go, currently.

The problem that (3) does not support HBase is simply solvable by adding
the HBase jars to the lib directory. In the future, this should be solved
by the YARN re-architecturing:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077

For the renewal of Kerberos tokens for streaming jobs: There is WIP and a
pull request to attach key tabs to a Flink job:
https://github.com/apache/flink/pull/2275

The problem that the YARN session is accessible by everyone is a bit more
tricky. In the future, this should be solved by these two parts:
  - With the YARN re-achitecturing, sessions are bound to individual users.
It should be possible to launch the session out of a single
YarnExecutionEnvironment and then submit multiple jobs against it.
  - The over-the-wire encryption and authentication should make sure that
no other user can send jobs to that session.

Greetings,
Stephan









On Mon, Aug 1, 2016 at 9:47 AM, Niels Basjes <ni...@basjes.nl> wrote:

> Hi,
>
> I have the situation that I have a Kerberos secured Yarn/HBase
> installation and I want to export data from a lot (~200) HBase tables to
> files on HDFS.
> I wrote a flink job that does this exactly the way I want it for a single
> table.
>
> Now in general I have a few possible approaches to do this for the 200
> tables I am facing:
>
> 1) Create a single job that reads the data from all of those tables and
> writes them to the correct files.
>     I expect that to be a monster that will hog the entire cluster because
> of the large number of HBase regions.
>
> 2) Run a job that does this for a single table and simply run that in a
> loop.
>     Essentially I would have a shellscript or 'main' that loops over all
> tablenames and run a flink job for each of those.
>     The downside of this is that it will start a new flink topology on
> Yarn for each table.
>     This has a startup overhead of something like 30 seconds for each
> table that I would like to avoid.
>
> 3) I start a single    yarn-session   and submit my job in there 200
> times.
>     That would solve most of the startup overhead yet this doesn't work.
>
> If I start yarn-session then I see these two relevant lines in the output.
>
> 2016-07-29 14:58:30,575 INFO  org.apache.flink.yarn.Utils
>                   - Attempting to obtain Kerberos security token for HBase
> 2016-07-29 14:58:30,576 INFO  org.apache.flink.yarn.Utils
>                   - HBase is not available (not packaged with this
> application): ClassNotFoundException :
> "org.apache.hadoop.hbase.HBaseConfiguration".
>
> As a consequence any flink job I submit cannot access HBase at all.
>
> As an experiment I changed my yarn-session.sh script to include HBase on
> the classpath. (If you want I can submit a Jira issue and a pull request)
> Now the yarn-session does have HBase available and the jobs runs as
> expected.
>
> There are how ever two problems that remain:
> 1) This yarnsession is accessible by everyone on the cluster and as a
> consequence they can run jobs in there that can access all data I have
> access to.
> 2) The kerberos token will expire after a while and (just like with all
> long running jobs) I would really like to have this to be a 'long lived'
> thing.
>
> As far as I know this is just the tip of the security ice berg and I would
> like to know what the correct approach is to solve this.
>
> Thanks.
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Reply via email to