Thanks for the pointers towards the work you are doing here.
I'll put up a patch for the jars and such in the next few days.
https://issues.apache.org/jira/browse/FLINK-4287

Niels Basjes

On Mon, Aug 1, 2016 at 11:46 AM, Stephan Ewen <se...@apache.org> wrote:

> Thank you for the breakdown of the problem.
>
> Option (1) or (2) would be the way to go, currently.
>
> The problem that (3) does not support HBase is simply solvable by adding
> the HBase jars to the lib directory. In the future, this should be solved
> by the YARN re-architecturing:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077
>
> For the renewal of Kerberos tokens for streaming jobs: There is WIP and a
> pull request to attach key tabs to a Flink job:
> https://github.com/apache/flink/pull/2275
>
> The problem that the YARN session is accessible by everyone is a bit more
> tricky. In the future, this should be solved by these two parts:
>   - With the YARN re-achitecturing, sessions are bound to individual
> users. It should be possible to launch the session out of a single
> YarnExecutionEnvironment and then submit multiple jobs against it.
>   - The over-the-wire encryption and authentication should make sure that
> no other user can send jobs to that session.
>
> Greetings,
> Stephan
>
>
>
>
>
>
>
>
>
> On Mon, Aug 1, 2016 at 9:47 AM, Niels Basjes <ni...@basjes.nl> wrote:
>
>> Hi,
>>
>> I have the situation that I have a Kerberos secured Yarn/HBase
>> installation and I want to export data from a lot (~200) HBase tables to
>> files on HDFS.
>> I wrote a flink job that does this exactly the way I want it for a single
>> table.
>>
>> Now in general I have a few possible approaches to do this for the 200
>> tables I am facing:
>>
>> 1) Create a single job that reads the data from all of those tables and
>> writes them to the correct files.
>>     I expect that to be a monster that will hog the entire cluster
>> because of the large number of HBase regions.
>>
>> 2) Run a job that does this for a single table and simply run that in a
>> loop.
>>     Essentially I would have a shellscript or 'main' that loops over all
>> tablenames and run a flink job for each of those.
>>     The downside of this is that it will start a new flink topology on
>> Yarn for each table.
>>     This has a startup overhead of something like 30 seconds for each
>> table that I would like to avoid.
>>
>> 3) I start a single    yarn-session   and submit my job in there 200
>> times.
>>     That would solve most of the startup overhead yet this doesn't work.
>>
>> If I start yarn-session then I see these two relevant lines in the output.
>>
>> 2016-07-29 14:58:30,575 INFO  org.apache.flink.yarn.Utils
>>                   - Attempting to obtain Kerberos security token for HBase
>> 2016-07-29 14:58:30,576 INFO  org.apache.flink.yarn.Utils
>>                   - HBase is not available (not packaged with this
>> application): ClassNotFoundException :
>> "org.apache.hadoop.hbase.HBaseConfiguration".
>>
>> As a consequence any flink job I submit cannot access HBase at all.
>>
>> As an experiment I changed my yarn-session.sh script to include HBase on
>> the classpath. (If you want I can submit a Jira issue and a pull request)
>> Now the yarn-session does have HBase available and the jobs runs as
>> expected.
>>
>> There are how ever two problems that remain:
>> 1) This yarnsession is accessible by everyone on the cluster and as a
>> consequence they can run jobs in there that can access all data I have
>> access to.
>> 2) The kerberos token will expire after a while and (just like with all
>> long running jobs) I would really like to have this to be a 'long lived'
>> thing.
>>
>> As far as I know this is just the tip of the security ice berg and I
>> would like to know what the correct approach is to solve this.
>>
>> Thanks.
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to