Hi Everyone, First of all let me explain you what I am trying to do and I apologize for writing a lengthy mail.
1) Pragmatically connect to remote secured(Kerberized) Hadoop cluster(CDH 5.7) from my local machine. - Once connected, I want to read the data from remote Hive table into Spark Dataframe. - Once the data is loaded into my local Dataframe, I would like to apply some transformations and do some tests. I know that we can use spark-shell from edge node and do these things, but I am trying to find out a way to do it from my IDE. My Local Environment (Windows 7): I am using IntelliJ IDE, Maven as build tool and Java . Things that I have got working, - Since the cluster is secured using Kerberos, I had to use a keytab file to authenticate like below, System.setProperty("java.security.krb5.conf", "C:\\Users\\Ajay\\Documents\\Kerberos\\krb5.conf"); Configuration conf = new Configuration(); conf.set("hadoop.security.authentication", "Kerberos"); UserGroupInformation.setConfiguration(conf); UserGroupInformation.loginUserFromKeytab("a...@internal.domain.com", "C:\\Users\\Ajay\\Documents\\Kerberos\\rc4\\rc4.keytab"); - Now that I have authenticated myself to talk to cluster using keytab, initially I tried to make a pure JDBC call (Not Spark API) and see if I was able to read the data successfully? and Yes I was able to read the data successfully this way like below, String driverName = "org.apache.hive.jdbc.HiveDriver"; Class.forName(driverName); String url = "jdbc:hive2://mydevcluster.domain.com:10000/test;principal=hive/_h...@internal.domain.com;saslQop=auth-conf"; Connection con = DriverManager.getConnection(url); String query = "select * from test.test_data limit 10"; Statement stmt = con.createStatement(); System.out.println("Executing Query..."); ResultSet rs = stmt.executeQuery(query); while (rs.next()) { String emp_name = rs.getString("emp_name"); System.out.println("Employee Name: "+emp_name); } Here is the Hive JDBC driver in my pom.xml <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.1.0</version> </dependency> - Now that I have made sure that the JDBC connection to secured cluster is working fine, the next step is to try to use Spark API to read the Hive table into Dataframe. I use Spark 1.6. I tried below, // Trying to use jdbc connection to Hive through spark 1.6 and hive jdbc 1.1.0 String JDBC_DB_URL = "jdbc:hive2://mydevcluster.domain.com:10000/test;principal=hive/_h...@internal.domain.com;saslQop=auth-conf"; Map<String, String> options = new HashMap<String, String>(); options.put("driver", "org.apache.hive.jdbc.HiveDriver"); options.put("url", JDBC_DB_URL); options.put("dbtable", "test.test_data"); DataFrame jdbcDF = hc.read().format("jdbc").options(options).load(); jdbcDF.printSchema(); Now I came across the below error, Exception in thread "main" java.sql.SQLException: Method not supported at org.apache.hive.jdbc.HiveResultSetMetaData.isSigned(HiveResultSetMetaData.java:141) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) at Dev_Cluster_Test.main(Dev_Cluster_Test.java:88) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Then I did look at the Spark API code base below. https://github.com/apache/spark/blob/branch-1.6/sql/core/src /main/scala/org/apache/spark/sql/execution/datasources/ jdbc/JDBCRDD.scala#L136 which is referring to hive-jdbc API code base below. https://github.com/apache/hive/blob/master/jdbc/src/java/org /apache/hive/jdbc/HiveResultSetMetaData.java#L143 Thus the error. Then I looked at Spark 2.0.0 API below. https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L139 Which results in same error "Method not supported". Can anyone please shed some lights on this and tell me if I am missing anything here. I appreciate your time. Thank you. Regards, Ajay