Re: spark-jdbc impala with kerberos using yarn-client

2017-09-05 Thread morfious902002
I was able to query data from Impala table. Here is my git repo for anyone who would like to check it :- https://github.com/morfious902002/impala-spark-jdbc-kerberos -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Re: spark-jdbc impala with kerberos using yarn-client

2017-07-03 Thread morfious902002
Did you ever find a solution to this? If so, can you share your solution? I am running into similar issue in YARN cluster mode connecting to impala table. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-jdbc-impala-with-kerberos-using-yarn-client-tp275

Re: Creating Dataframe by querying Impala

2017-06-01 Thread morfious902002
The issue seems to be with primordial class loader. I cannot load the drivers to all the nodes at the same location but have loaded the jars to HDFS. I have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the edge node with no luck. Is there another way to load these jars through primord

Creating Dataframe by querying Impala

2017-05-31 Thread morfious902002
Hi, I am trying to create a Dataframe by querying Impala Table. It works fine in my local environment but when I try to run it in cluster I either get Error:java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver or No Suitable Driver found. Can someone help me or direct me to how

Saving parquet file in Spark giving error when Encryption at Rest is implemented

2017-01-30 Thread morfious902002
We are using spark 1.6.1 on a CDH 5.5 cluster. The job worked fine with Kerberos but when we implemented Encryption at Rest we ran into the following issue:- Df.write().mode(SaveMode.Append).partitionBy("Partition").parquet(path); I have already tried setting these values with no success :-

Slow Parquet write to HDFS using Spark

2016-11-03 Thread morfious902002
I am using Spark 1.6.1 and writing to HDFS. In some cases it seems like all the work is being done by one thread. Why is that? Also, I need parquet.enable.summary-metadata to register the parquet files to Impala. Df.write().partitionBy("COLUMN").parquet(outputFileLocation); It also, seems li

Best Savemode option to write Parquet file

2016-10-06 Thread morfious902002
Hi all, I have searched a bit before posting this query. Using Spark 1.6.1 Dataframe.write().format("parquet").mode(SaveMode.Append).save("location) Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist. Which Savemode is the best, if necessary at all

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread morfious902002
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. I am using spark 1.5.1 with YARN. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if (mapJava

Create a Spark cluster with cloudera CDH 5.2 support

2015-03-20 Thread morfious902002
Hi, I am trying to create a Spark cluster using the spark-ec2 script which will support 2.5.0-cdh5.3.2 for HDFS as well as Hive. I created a cluster by adding --hadoop-major-version=2.5.0 which solved some of the errors I was getting. But now when I run select query on hive I get the following erro

EC2 cluster created by spark using old HDFS 1.0

2015-03-20 Thread morfious902002
Hi, I created a cluster using spark-ec2 script. But it installs HDFS version 1.0. I would like to use this cluster to connect to HIVE installed on a cloudera CDH 5.3 cluster. But I am getting the following error:- org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with