Challenges with Datasource V2 API

2019-06-25 Thread Sunita Arvind
to surface the problem. Can someone review the code and tell me if I am doing something wrong? regards Sunita

Re: a way to allow spark job to continue despite task failures?

2018-01-24 Thread Sunita Arvind
for failed tasks were done, other tasks completed. You can set it to higher or lower value depending on how many more tasks you have and the duration they take to complete. regards Sunita On Fri, Nov 13, 2015 at 4:50 PM, Ted Yu wrote: > I searched the code base and looked at: > https://spark

Re: Chaining Spark Streaming Jobs

2017-11-02 Thread Sunita Arvind
(UnsupportedOperationChecker.scala:297) regards Sunita On Mon, Sep 18, 2017 at 10:15 AM, Michael Armbrust wrote: > You specify the schema when loading a dataframe by calling > spark.read.schema(...)... > > On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind > wrote: > >> Hi Micha

Change the owner of hdfs file being saved

2017-11-02 Thread Sunita Arvind
usecase. Is there a way to change the owner of files written by Spark? regards Sunita

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread Sunita Arvind
> Le 13 sept. 2017 01:51, "Sunita Arvind" a écrit : > > Hi Michael, > > I am wondering what I am doing wrong. I get error like: > > Exception in thread "main" java.lang.IllegalArgumentException: Schema > must be specified when creating a streaming source D

Re: Chaining Spark Streaming Jobs

2017-09-12 Thread Sunita Arvind
spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:222) While running on the EMR cluster all paths p

Re: Chaining Spark Streaming Jobs

2017-09-08 Thread Sunita Arvind
Thanks for your response Praneeth. We did consider Kafka however cost was the only hold back factor as we might need a larger cluster and existing cluster is on premise and my app is on cloud. So the same cluster cannot be used. But I agree it does sound like a good alternative. Regards Sunita

Re: Chaining Spark Streaming Jobs

2017-09-07 Thread Sunita Arvind
Thanks for your response Michael Will try it out. Regards Sunita On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream read using the file source. This will maintain exactly > once processin

Chaining Spark Streaming Jobs

2017-08-21 Thread Sunita Arvind
to be error prone. When either of the jobs get delayed due to bursts or any error/exception this could lead to huge data losses and non-deterministic behavior . What are other alternatives to this? Appreciate any guidance in this regard. regards Sunita Koppar

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
parquet with null in the numeric fields. Is there a workaround to it? I need to be able to allow null values for numeric fields Thanks in advance. regards Sunita

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
re I am not doing an overkill or overseeing a potential issue. regards Sunita On Tue, Oct 25, 2016 at 2:38 PM, Sunita Arvind wrote: > The error in the file I just shared is here: > > val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/" + > partition._2(0); -

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
Thanks for the response Sean. I have seen the NPE on similar issues very consistently and assumed that could be the reason :) Thanks for clarifying. regards Sunita On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen wrote: > This usage is fine, because you are only using the HiveContext locally

Re: HiveContext is Serialized?

2016-10-25 Thread Sunita Arvind
u can create the dataframe in main, you can register it as a table and run the queries in main method itself. You don't need to coalesce or run the method within foreach. Regards Sunita On Tuesday, October 25, 2016, Ajay Chander wrote: > > Jeff, Thanks for your response. I see below e

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
eeper") df.saveAsParquetFile(conf.getString("ParquetOutputPath")+offsetSaved) LogHandler.log.info("Created the parquet file") } Thanks Sunita On Tue, Oct 25, 2016 at 2:11 PM, Sunita Arvind wrote: > Attached is the edited code. Am I heading in right direc

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Sunita On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind wrote: > Thanks for confirming Cody. > To get to use the library, I had to do: > > val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"), > "/consumers/topics/"+ topics + "/0") >

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
nt the library to pick all the partitions for a topic, without me specifying the path, is it possible out of the box or I need to tweak? regards Sunita On Tue, Oct 25, 2016 at 12:08 PM, Cody Koeninger wrote: > You are correct that you shouldn't have to worry about broker id. > > I'm

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
Just re-read the kafka architecture. Something that slipped my mind is, it is leader based. So topic/partitionId pair will be same on all the brokers. So we do not need to consider brokerid while storing offsets. Still exploring rest of the items. regards Sunita On Tue, Oct 25, 2016 at 11:09 AM

Re: Zero Data Loss in Spark with Kafka

2016-10-25 Thread Sunita Arvind
are not considering brokerIds which storing offsets and probably the OffsetRanges does not have it either. It can only provide Topic, partition, from and until offsets. I am probably missing something very basic. Probably the library works well by itself. Can someone/ Cody explain? Cody, Thanks a lot f

Spark writing to elasticsearch asynchronously

2016-09-21 Thread Sunita Arvind
Hello Experts, Is there a way to get spark to write to elasticsearch asynchronously? Below are the details http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously regards Sunita

Increasing spark.yarn.executor.memoryOverhead degrades performance

2016-07-18 Thread Sunita Arvind
interesting observation is, bringing down the executor memory to 5GB with executor memoryOverhead to 768 showed significant performance gains. What are the other associated settings? regards Sunita

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread Sunita Arvind
Thank you for your inputs. Will test it out and share my findings On Thursday, July 14, 2016, CosminC wrote: > Didn't have the time to investigate much further, but the one thing that > popped out is that partitioning was no longer working on 1.6.1. This would > definitely explain the 2x perfo

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Sunita
I am facing the same issue. Upgrading to Spark1.6 is causing hugh performance loss. Could you solve this issue? I am also attempting memory settings as mentioned http://spark.apache.org/docs/latest/configuration.html#memory-management But its not making a lot of difference. Appreciate your inputs

Maintain complete state for updateStateByKey

2016-07-06 Thread Sunita Arvind
also trying to figure out if I can use the (iterator: Iterator[(K, Seq[V], Option[S])]) but haven't figured it out yet. Appreciate any suggestions in this regard. regards Sunita P.S: I am aware of mapwithState but not on the latest version as of now.

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
distribution data sets. Mentioning it here for benefit of anyone else stumbling upon the same issue. regards Sunita On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16/06/23 03:06:59 ERROR streaming.StreamingContext:

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
r.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 38 more 16/06/23 11:09:53 INFO SparkContext: Invoking stop() from shutdown hook I've tried kafka version 0.8.2.0, 0.8.2.2, 0.9.0.0. With 0.9.0.0 the processing hangs much sooner. Can someone help with this error? reg

NullPointerException when starting StreamingContext

2016-06-22 Thread Sunita Arvind
c.awaitTermination() } } } I also tried putting all the initialization directly in main (not using method calls for initializeSpark and createDataStreamFromKafka) and also not putting in foreach and creating a single spark and streaming context. However, the error persists. Appreciate any help. regards Sunita

Seeking advice on realtime querying over JDBC

2016-06-02 Thread Sunita Arvind
do I need to have HiveContext in order to see the tables registered via Spark application through the JDBC? regards Sunita

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Thanks for the clarification Michael and good luck with Spark 2.0. It really looks promising. I am especially interested in adhoc queries aspect. Probably that is what is being referred to as Continuous SQL in the slides. What is the timeframe for availability this functionality? regards Sunita

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
2.1 or later only regards Sunita On Fri, May 6, 2016 at 1:06 PM, Michael Malak wrote: > At first glance, it looks like the only streaming data sources available > out of the box from the github master branch are > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/o

Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
ensure it works for our use cases. Can someone point me to relevant material for this. regards Sunita

Spark SQL - Registerfunction throwing MissingRequirementError in JavaMirror with primordial classloader

2015-04-26 Thread Sunita Arvind
ed: date1 is 2005-07-18 00:00:00 format is org.joda.time.format.DateTimeFormatter@28d101f3 date2 is 20150719 format is org.joda.time.format.DateTimeFormatter@5e411af2 Within 10 years FromDT =2005-07-18 00:00:00ToDT =20150719within10years =true actual number of years i

Unable to broadcast dimension tables with Spark SQL

2015-02-16 Thread Sunita Arvind
e("rdd1.key".attr === "rdd2.key".attr)) - DSL Style execution plan --> HashOuterJoin [education#18], [i1_education_cust_demo#29], LeftOuter, None Exchange (HashPartitioning [educ

Is pair rdd join more efficient than regular rdd

2015-02-01 Thread Sunita Arvind
ot of effort for us to try this approach and weight the performance as we need to register the output as tables to proceed using them. Hence would appreciate inputs from the community before proceeding. Regards Sunita Koppar

Re: Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-21 Thread Sunita Arvind
I was able to resolve this by adding rdd.collect() after every stage. This enforced RDD evaluation and helped avoid the choke point. regards Sunita Kopppar On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind wrote: > Hi, > > My spark jobs suddenly started getting hung and here is the debu

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Sunita Arvind
names. The spark sql wiki has good examples for this. Looks more easy to manage to me than your solution below. Agree with you on the fact that when there are lot of columns, row.getString() even once is not convenient Regards Sunita On Tuesday, January 20, 2015, Night Wolf wrote: > In Spark

Spark job stuck at RangePartitioner at Exchange.scala:79

2015-01-17 Thread Sunita Arvind
ot;) sparkConf.set("spark.driver.memory","512m") sparkConf.set("spark.executor.memory","1g") sparkConf.set("spark.driver.maxResultSize","1g") Please note. In eclipse as well as sbt> the program kept throwing StackOverflow. Increasing Xss to 5 MB eliminated the problem, Could this be something unrelated to memory? The SchemaRDDs have close to 400 columns and hence I am using StructType(StructField) and performing applySchema. My code cannot be shared right now. If required, I will edit it and post. regards Sunita

Transform SchemaRDDs into new SchemaRDDs

2014-12-08 Thread Sunita Arvind
apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:94) at croevss.StageJoin$.vsswf(StageJoin.scala:162) at croevss.StageJoin$.main(StageJoin.scala:41) at croevss.StageJoin.main(StageJoin.scala) regards Sunita Koppar

Re: Spark setup on local windows machine

2014-12-02 Thread Sunita Arvind
trapper.scala) regards Sunita On Tue, Nov 25, 2014 at 11:47 PM, Sameer Farooqui wrote: > Hi Sunita, > > This gitbook may also be useful for you to get Spark running in local mode > on your Windows machine: > http://blueplastic.gitbooks.io/how-to-light-your-spark-on-a-stick/content/ &g

Spark setup on local windows machine

2014-11-25 Thread Sunita Arvind
ine. Appreciate your help. regards Sunita

Re: GraphX usecases

2014-08-25 Thread Sunita Arvind
Thanks for the clarification Ankur Appreciate it. Regards Sunita On Monday, August 25, 2014, Ankur Dave wrote: > At 2014-08-25 11:23:37 -0700, Sunita Arvind > wrote: > > Does this "We introduce GraphX, which combines the advantages of both > > data-parallel and gr

GraphX usecases

2014-08-25 Thread Sunita Arvind
fault-tolerance." mean that GraphX makes the typical RDBMS operations possible even when the data is persisted in a GDBMS and not viceversa? regards Sunita

Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-07-02 Thread Sunita Arvind
/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ > > Romain > > > > Romain > > > On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind > wrote: > >> Hello Experts, >> >> I am attempting to integrate Spark Editor with

Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-06-24 Thread Sunita Arvind
Hello Experts, I am attempting to integrate Spark Editor with Hue on CDH5.0.1. I have the spark installation build manually from the sources for spark1.0.0. I am able to integrate this with cloudera manager. Background: --- We have a 3 node VM cluster with CDH5.0.1 We requried spa