Spark mapPartition output object size coming larger than expected

2017-02-06 Thread nitinkak001
I am storing the output of mapPartitions in a ListBuffer and exposing its iterator as the output. The output is a list of Long tuples(Tuple2). When I check the size of the object using Spark's SizeEstimator.estimate method it comes out to 80 bytes per record/tuple object(calculating this by "size o

How to add all jars in a folder to executor classpath?

2016-10-18 Thread nitinkak001
I need to add all the jars in hive/lib to my spark job executor classpath. I tried this spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib and spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/* but it does not ad

Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001
I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to exec

Out of Memory error caused by output object in mapPartitions

2016-02-15 Thread nitinkak001
My mapPartition code as given below outputs one record for each input record. So, the output object has equal number of records as input. I am loading the output data into a listbuffer object. This object is turning out to be too huge for memory leading to Out Of Memory exception. To be more clea

Spark SQL incompatible with Apache Sentry(Cloudera bundle)

2015-06-24 Thread nitinkak001
CDH version: 5.3 Spark Version: 1.2 I was trying to execute a Hive query from Spark code(using HiveContext class). It was working fine untill we installed Apache Sentry. Now its giving me read permission exception. /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn

Re: Does HiveContext connect to HiveServer2?

2015-06-22 Thread nitinkak001
Hey, I have exactly this question. Did you get an answer to it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread nitinkak001
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUT

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001
I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-confi

Generating version agnostic jar path value for --jars clause

2015-05-02 Thread nitinkak001
I have a list of cloudera jars which I need to provide in --jars clause, mainly for the HiveContext functionality I am using. However, many of these jars have version number as part of their names. This leads to an issue that the names might change when I do a Cloudera upgrade. Just a note here,

Re: Weird exception in Spark job

2015-03-24 Thread nitinkak001
Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Does HiveContext connect to HiveServer2?

2015-03-24 Thread nitinkak001
I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spar

Weird exception in Spark job

2015-03-23 Thread nitinkak001
I am trying to run a Hive query from Spark code through HiveContext. Anybody knows what these exceptions mean? I have no clue LogType: stderr LogLength: 3345 Log Contents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3

Is yarn-standalone mode deprecated?

2015-03-23 Thread nitinkak001
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action. Remov

HiveContext test, "Spark Context did not initialize after waiting 10000ms"

2015-03-06 Thread nitinkak001
I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName("HiveSparkIntegrationTest") conf.set("spark.executor.extraClassPath", "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib"); conf.set("spark.driver.ext

Re: Running Spark jobs via oozie

2015-03-03 Thread nitinkak001
I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Executing hive query from Spark code

2015-03-02 Thread nitinkak001
I want to run Hive query inside Spark and use the RDDs generated from that inside Spark. I read in the documentation "/Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive assemb

Counters in Spark

2015-02-13 Thread nitinkak001
I am trying to implement counters in Spark and I guess Accumulators are the way to do it. My motive is to update a counter in map function and access/reset it in the driver code. However the /println/ statement at the end still yields value 0(It should 9). Am I doing something wrong? def main(arg

Where can I find logs set inside RDD processing functions?

2015-02-06 Thread nitinkak001
I am trying to debug my mapPartitionsFunction. Here is the code. There are two ways I am trying to log using log.info() or println(). I am running in yarn-cluster mode. While I can see the logs from driver code, I am not able to see logs from map, mapPartition functions in the Application Tracking

How to get Hive table schema using Spark SQL or otherwise

2015-02-04 Thread nitinkak001
I want to get a Hive table schema details into Spark. Specifically, I want to get column name and type information. Is it possible to do it e.g using JavaSchemaRDD or something else? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-Hive-table-sch

Re: Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
Just to add, I am suing Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
I am trying to implement secondary sort in spark as we do in map-reduce. Here is my data(tab separated, without c1, c2, c2). c1c2 c3 1 2 4 1 3 6 2 4 7 2 6 8 3 5 5 3 1 8 3 2 0 To do secondary sort, I crea

Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-02 Thread nitinkak001
Mine was not really a moving average problem. It was more like partitioning on some keys and sorting(on different keys) and then running a sliding window through the partition. I reverted back to map-reduce for that(I needed secondary sort, which is not very mature in Spark right now). But, as far

Running "beyond memory limits" in ConnectedComponents

2015-01-14 Thread nitinkak001
I am trying to run connected components algorithm in Spark. The graph has roughly 28M edges and 3.2M vertices. Here is the code I am using /val inputFile = "/user/hive/warehouse/spark_poc.db/window_compare_output_text/00_0" val conf = new SparkConf().setAppName("ConnectedComponentsTest")

Connected Components running for a long time and failing eventually

2014-11-24 Thread nitinkak001
I am trying to run connected components on a graph generated by reading an edge file. Its running for a long time(3-4 hrs) and then eventually failing. Cant find any error in log file. The file I am testing it on has 27M rows(edges). Is there something obviously wrong with the code? I tested the s

Partition sorting by Spark framework

2014-11-05 Thread nitinkak001
I need to sort my RDD partitions but the whole partition(s) might not fit into memory, so I cannot run the Collections Sort() method. Does Spark support partitions sorting by virtue of its framework? I am working on 1.1.0 version. I looked up similar unanswered question: /http://apache-spark-user

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
So, this means that I can create table and insert data in it with Dynbamic partitioning and those partitions would be inherited by RDDs. Is it in Spark 1.1.0? If not, is there a way to partition the data in a file based on some attributes of the rows in the data data(without hardcoding the number

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-27 Thread nitinkak001
Would the rdd resulting from the below query be partitioned on GEO_REGION, GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and seems that there are always 50 partitions generated while there should be around 1000. /"SELECT * FROM spark_poc.DISTRIBUTE BY GEO_REGION, GEO_COUN

Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread nitinkak001
I am working on running the following hive query from spark. /"SELECT * FROM spark_poc. DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID"/ Ran into /java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/ (complete stack tr

JavaHiveContext class not found error. Help!!

2014-10-23 Thread nitinkak001
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1. What can I do to make this work? /"SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID";/ Below is the stack trace: Exception in thread "Thread-4" java.lang.reflect.InvocationTarget

com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001
I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftw

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-10 Thread nitinkak001
. . . . . . . . . . . . .* > > > *IF WE CAN’T DOUBLE YOUR SALES,* > > > > *ONE OF US IS IN THE WRONG BUSINESS.* > > *E*: [hidden email] <http://user/SendEmail.jtp?type=node&node=15407&i=0> > > > *M*: *510.303.7751 <510.303.7751>* > > On Tue,

Re: Window comparison matching using the sliding window functionality: feasibility

2014-09-30 Thread nitinkak001
Any ideas guys? Trying to find some information online. Not much luck so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html Sent from the Apache Spark User Li

Window comparison matching using the sliding window functionality: feasibility

2014-09-29 Thread nitinkak001
Need to know the feasibility of the below task. I am thinking of this one to be a mapreduce-spark effort. I need to run distributed sliding Window Comparison for digital data matching on top of Hadoop. The data(Hive Table) will be partitioned, distributed across data node. Then the window comparis