On 19 Aug 2017, at 02:42, Imtiaz Ahmed
<[email protected]<mailto:[email protected]>> wrote:
Hi All,
I am building a spark library which developers will use when writing their
spark jobs to get access to data on Azure Data Lake. But the authentication
will depend on the dataset they ask for. I need to call a rest API from within
spark job to get credentials and authenticate to read data from ADLS. Is that
even possible? I am new to spark.
E.g, from inside a spark job a user will say:
MyCredentials myCredentials = MyLibrary.getCredentialsForPath(userId,
"/some/path/on/azure/datalake");
then before spark.read.json("adl://examples/src/main/resources/people.json")
I need to authenticate the user to be able to read that path using the
credentials fetched above.
Any help is appreciated.
Thanks,
Imtiaz
The ADL filesystem supports addDelegationTokens(); allowing the caller to
collect the delegation tokens of the current authenticated user & then pass it
along with the request —which is exactly what spark should be doing in spark
submit.
if you want to do it yourself, look in SparkHadoopUtils (I think; IDE is closed
right now) & see how the tokens are picked up and then passed around
(marshalled over the job request, unmarshalled after & picked up, with bits of
the UserGroupInformation class doing the low level work)
Java code snippet to write to the path tokenFile:
FileSystem fs = FileSystem.get(conf);
Credentials cred = new Credentials();
Token<?> tokens[] = fs.addDelegationTokens(renewer, cred);
cred.writeTokenStorageFile(tokenFile, conf);
you can then read that file in elsewhere, and then (somehow) get the FS to use
those toakens
otherwise, ADL supports Oauth, so you may be able to use any Oauth libraries
for this. hadoop-azure-dalalake pulls in okhttp for that,
<dependency>
<groupId>com.squareup.okhttp</groupId>
<artifactId>okhttp</artifactId>
<version>2.4.0</version>
</dependency>
-Steve