Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Hi All, We are evaluating a few real time streaming query engines and spark is my personal choice. The addition of adhoc queries is what is getting me further excited about it, however the talks I have heard so far only mention about it but do not provide details. I need to build a prototype to en

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
ctured Streaming; or if it > is, it's not obvious how to write such code. > > > ------ > *From:* Anthony May > *To:* Deepak Sharma ; Sunita Arvind < > sunitarv...@gmail.com> > *Cc:* "user@spark.apache.org" > *Sent:* Friday, Ma

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
;API >>> design: data sources and sinks" is relevant here. >>> >>> In short, it would seem the code is not there yet to create a Kafka-fed >>> Dataframe/Dataset that can be queried with Structured Streaming; or if it >>> is, it's not obvious how to wri

Seeking advice on realtime querying over JDBC

2016-06-02 Thread Sunita Arvind
Hi Experts, We are trying to get a kafka stream ingested in Spark and expose the registered table over JDBC for querying. Here are some questions: 1. Spark Streaming supports single context per application right? If I have multiple customers and would like to create a kafka topic for each of them

NullPointerException when starting StreamingContext

2016-06-22 Thread Sunita Arvind
Hello Experts, I am getting this error repeatedly: 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped java.lang.NullPointerException at com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202) at

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
ards Sunita On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the > context, marking it as stopped > java.

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
distribution data sets. Mentioning it here for benefit of anyone else stumbling upon the same issue. regards Sunita On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16/06/23 03:06:59 ERROR streaming.StreamingContext:

Maintain complete state for updateStateByKey

2016-07-06 Thread Sunita Arvind
Hello Experts, I have a requirement of maintaining a list of ids for every customer for all of time. I should be able to provide count distinct ids on demand. All the examples I have seen so far indicate I need to maintain counts directly. My concern is, I will not be able to identify cumulative d

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread Sunita Arvind
Thank you for your inputs. Will test it out and share my findings On Thursday, July 14, 2016, CosminC wrote: > Didn't have the time to investigate much further, but the one thing that > popped out is that partitioning was no longer working on 1.6.1. This would > definitely explain the 2x perfo

Increasing spark.yarn.executor.memoryOverhead degrades performance

2016-07-18 Thread Sunita Arvind
Hello Experts, For one of our streaming appilcation, we intermittently saw: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Based on what I found on internet and the error

Spark writing to elasticsearch asynchronously

2016-09-21 Thread Sunita Arvind
Hello Experts, Is there a way to get spark to write to elasticsearch asynchronously? Below are the details http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously regards Sunita

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Hello Experts, I am trying to use the saving to ZK design. Just saw Sudhir's comments that it is old approach. Any reasons for that? Any issues observed with saving to ZK. The way we are planning to use it is: 1. Following http://aseigneurin.github.io/2016/05/07/spark-kafka- achieving-zero-data-lo

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
, Sunita Arvind wrote: > Hello Experts, > > I am trying to use the saving to ZK design. Just saw Sudhir's comments > that it is old approach. Any reasons for that? Any issues observed with > saving to ZK. The way we are planning to use it is: > 1. Following http://aseigneur

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
honestly not sure specifically what else you are asking at this point. > > On Tue, Oct 25, 2016 at 1:39 PM, Sunita Arvind > wrote: > > Just re-read the kafka architecture. Something that slipped my mind is, > it > > is leader based. So topic/partitionId pair will be sa

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Sunita On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind wrote: > Thanks for confirming Cody. > To get to use the library, I had to do: > > val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"), > "/consumers/topics/"+ topics + "/0") >

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
eeper") df.saveAsParquetFile(conf.getString("ParquetOutputPath")+offsetSaved) LogHandler.log.info("Created the parquet file") } Thanks Sunita On Tue, Oct 25, 2016 at 2:11 PM, Sunita Arvind wrote: > Attached is the edited code. Am I heading in right direc

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Ajay, Afaik Generally these contexts cannot be accessed within loops. The sql query itself would run on distributed datasets so it's a parallel execution. Putting them in foreach would make it nested in nested. So serialization would become hard. Not sure I could explain it right. If you can crea

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Thanks for the response Sean. I have seen the NPE on similar issues very consistently and assumed that could be the reason :) Thanks for clarifying. regards Sunita On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen wrote: > This usage is fine, because you are only using the HiveContext locally on > the

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
re I am not doing an overkill or overseeing a potential issue. regards Sunita On Tue, Oct 25, 2016 at 2:38 PM, Sunita Arvind wrote: > The error in the file I just shared is here: > > val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/" + > partition._2(0); -

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
Hello Experts, I am trying to allow null values in numeric fields. Here are the details of the issue I have: http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields I also tried making all columns nullable by using the below function (from one of the

Chaining Spark Streaming Jobs

2017-08-21 Thread Sunita Arvind
Hello Spark Experts, I have a design question w.r.t Spark Streaming. I have a streaming job that consumes protocol buffer encoded real time logs from a Kafka cluster on premise. My spark application runs on EMR (aws) and persists data onto s3. Before I persist, I need to strip header and convert p

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Sunita Arvind
g even if there are hiccups or failures. > > On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind > wrote: > >> Hello Spark Experts, >> >> I have a design question w.r.t Spark Streaming. I have a streaming job >> that consumes protocol buffer encoded real time logs

Re: Chaining Spark Streaming Jobs

2017-09-08 Thread Sunita Arvind
a > few mins, you may have to end up creating a new file every few mins > > You may want to consider Kafka as your intermediary store for building a > chain/DAG of streaming jobs > > On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind > wrote: > >> Thanks for your response Michae

Re: Chaining Spark Streaming Jobs

2017-09-12 Thread Sunita Arvind
oint to S3. In my laptop, they all point to local filesystem. I am using Spark2.2.0 Appreciate your help. regards Sunita On Wed, Aug 23, 2017 at 2:30 PM, Michael Armbrust wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream read using the fil

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread Sunita Arvind
> Le 13 sept. 2017 01:51, "Sunita Arvind" a écrit : > > Hi Michael, > > I am wondering what I am doing wrong. I get error like: > > Exception in thread "main" java.lang.IllegalArgumentException: Schema > must be specified when creating a streaming source D

Change the owner of hdfs file being saved

2017-11-02 Thread Sunita Arvind
Hello Experts, I am required to use a specific user id to save files on a remote hdfs cluster. Remote in the sense, spark jobs run on EMR and write to a CDH cluster. Hence I cannot change the hdfs-site.xml etc to point to the destination cluster. As a result I am using webhdfs to save the files in

Re: Chaining Spark Streaming Jobs

2017-11-02 Thread Sunita Arvind
(UnsupportedOperationChecker.scala:297) regards Sunita On Mon, Sep 18, 2017 at 10:15 AM, Michael Armbrust wrote: > You specify the schema when loading a dataframe by calling > spark.read.schema(...)... > > On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind > wrote: > >> Hi Micha

Re: a way to allow spark job to continue despite task failures?

2018-01-24 Thread Sunita Arvind
Had a similar situation and landed on this question. Finally I was able to make it do what I needed by cheating the spark driver :) i.e By setting a very high value for "--conf spark.task.maxFailures=800". I made it 800 deliberately which typically is 4. So by the time 800 attempts for failed tasks

Challenges with Datasource V2 API

2019-06-25 Thread Sunita Arvind
Hello Spark Experts, I am having challenges using the DataSource V2 API. I created a mock The input partitions seem to be created correctly. The below output confirms that: 19/06/23 16:00:21 INFO root: createInputPartitions 19/06/23 16:00:21 INFO root: Create a partition for abc The InputPartit

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Sunita Arvind
The below is not exactly a solution to your question but this is what we are doing. For the first time we do end up doing row.getstring() and we immediately parse it through a map function which aligns it to either a case class or a structType. Then we register it as a table and use just column nam

Re: Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-21 Thread Sunita Arvind
I was able to resolve this by adding rdd.collect() after every stage. This enforced RDD evaluation and helped avoid the choke point. regards Sunita Kopppar On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind wrote: > Hi, > > My spark jobs suddenly started getting hung and here is the debu

Is pair rdd join more efficient than regular rdd

2015-02-01 Thread Sunita Arvind
Hi All We are joining large tables using spark sql and running into shuffle issues. We have explored multiple options - using coalesce to reduce number of partitions, tuning various parameters like disk buffer, reducing data in chunks etc. which all seem to help btw. What I would like to know is,

Unable to broadcast dimension tables with Spark SQL

2015-02-16 Thread Sunita Arvind
Hi Experts, I have a large table with 54 million records (fact table), being joined with 6 small tables (dimension tables). The size on disk of small tables is within 5k and the record count is in the range of 4 - 200 All the worker nodes have RAM of 32GB allocated for spark. I have tried the belo

Spark SQL - Registerfunction throwing MissingRequirementError in JavaMirror with primordial classloader

2015-04-26 Thread Sunita Arvind
Hi All, I am trying to use a function within spark sql which accepts 2 - 4 arguments. I was able to get through compilation errors however, I see the attached runtime exception when trying from Spark SQL. (refer attachment for the complete stacktrace- StackTraceFor_runTestInSQL) The function itse

Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-06-24 Thread Sunita Arvind
Hello Experts, I am attempting to integrate Spark Editor with Hue on CDH5.0.1. I have the spark installation build manually from the sources for spark1.0.0. I am able to integrate this with cloudera manager. Background: --- We have a 3 node VM cluster with CDH5.0.1 We requried spa

Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-07-02 Thread Sunita Arvind
/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ > > Romain > > > > Romain > > > On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind > wrote: > >> Hello Experts, >> >> I am attempting to integrate Spark Editor with

GraphX usecases

2014-08-25 Thread Sunita Arvind
Hi, I am exploring GraphX library and trying to determine which usecases make most sense for/with it. From what I initially thought, it looked like GraphX could be applied to data stored in RDBMSs as Spark could translate the relational data into graphical representation. However, there seems to b

Re: GraphX usecases

2014-08-25 Thread Sunita Arvind
Thanks for the clarification Ankur Appreciate it. Regards Sunita On Monday, August 25, 2014, Ankur Dave wrote: > At 2014-08-25 11:23:37 -0700, Sunita Arvind > wrote: > > Does this "We introduce GraphX, which combines the advantages of both > > data-parallel and gr

Spark setup on local windows machine

2014-11-25 Thread Sunita Arvind
Hi All, I just installed a spark on my laptop and trying to get spark-shell to work. Here is the error I see: C:\spark\bin>spark-shell Exception in thread "main" java.util.NoSuchElementException: key not found: CLAS SPATH at scala.collection.MapLike$class.default(MapLike.scala:228)

Re: Spark setup on local windows machine

2014-12-02 Thread Sunita Arvind
t; > On Tue, Nov 25, 2014 at 11:09 PM, Akhil Das > wrote: > >> You could try following this guidelines >> http://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows >> >> Thanks >> Best Regards >> >> On Wed, Nov 26, 2014 at 12:24

Transform SchemaRDDs into new SchemaRDDs

2014-12-08 Thread Sunita Arvind
Hi, I need to generate some flags based on certain columns and add it back to the schemaRDD for further operations. Do I have to use case class (reflection or programmatically). I am using parquet files, so schema is being automatically derived. This is a great feature. thanks to Spark developers,

Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-17 Thread Sunita Arvind
Hi, My spark jobs suddenly started getting hung and here is the debug leading to it: Following the program, it seems to be stuck whenever I do any collect(), count or rdd.saveAsParquet file. AFAIK, any operation that requires data flow back to master causes this. I increased the memory to 5 MB. Al