Hi All,
We are evaluating a few real time streaming query engines and spark is my
personal choice. The addition of adhoc queries is what is getting me
further excited about it, however the talks I have heard so far only
mention about it but do not provide details. I need to build a prototype to
en
ctured Streaming; or if it
> is, it's not obvious how to write such code.
>
>
> ------
> *From:* Anthony May
> *To:* Deepak Sharma ; Sunita Arvind <
> sunitarv...@gmail.com>
> *Cc:* "user@spark.apache.org"
> *Sent:* Friday, Ma
;API
>>> design: data sources and sinks" is relevant here.
>>>
>>> In short, it would seem the code is not there yet to create a Kafka-fed
>>> Dataframe/Dataset that can be queried with Structured Streaming; or if it
>>> is, it's not obvious how to wri
Hi Experts,
We are trying to get a kafka stream ingested in Spark and expose the
registered table over JDBC for querying. Here are some questions:
1. Spark Streaming supports single context per application right? If I have
multiple customers and would like to create a kafka topic for each of them
Hello Experts,
I am getting this error repeatedly:
16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the
context, marking it as stopped
java.lang.NullPointerException
at
com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
at
ards
Sunita
On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind
wrote:
> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the
> context, marking it as stopped
> java.
distribution data sets. Mentioning it here for
benefit of anyone else stumbling upon the same issue.
regards
Sunita
On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind
wrote:
> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext:
Hello Experts,
I have a requirement of maintaining a list of ids for every customer for
all of time. I should be able to provide count distinct ids on demand. All
the examples I have seen so far indicate I need to maintain counts
directly. My concern is, I will not be able to identify cumulative d
Thank you for your inputs. Will test it out and share my findings
On Thursday, July 14, 2016, CosminC wrote:
> Didn't have the time to investigate much further, but the one thing that
> popped out is that partitioning was no longer working on 1.6.1. This would
> definitely explain the 2x perfo
Hello Experts,
For one of our streaming appilcation, we intermittently saw:
WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory
limits. 12.0 GB of 12 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
Based on what I found on internet and the error
Hello Experts,
Is there a way to get spark to write to elasticsearch asynchronously?
Below are the details
http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously
regards
Sunita
Hello Experts,
I am trying to use the saving to ZK design. Just saw Sudhir's comments that
it is old approach. Any reasons for that? Any issues observed with saving
to ZK. The way we are planning to use it is:
1. Following http://aseigneurin.github.io/2016/05/07/spark-kafka-
achieving-zero-data-lo
, Sunita Arvind
wrote:
> Hello Experts,
>
> I am trying to use the saving to ZK design. Just saw Sudhir's comments
> that it is old approach. Any reasons for that? Any issues observed with
> saving to ZK. The way we are planning to use it is:
> 1. Following http://aseigneur
honestly not sure specifically what else you are asking at this point.
>
> On Tue, Oct 25, 2016 at 1:39 PM, Sunita Arvind
> wrote:
> > Just re-read the kafka architecture. Something that slipped my mind is,
> it
> > is leader based. So topic/partitionId pair will be sa
Sunita
On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind
wrote:
> Thanks for confirming Cody.
> To get to use the library, I had to do:
>
> val offsetsStore = new ZooKeeperOffsetsStore(conf.getString("zkHosts"),
> "/consumers/topics/"+ topics + "/0")
>
eeper")
df.saveAsParquetFile(conf.getString("ParquetOutputPath")+offsetSaved)
LogHandler.log.info("Created the parquet file")
}
Thanks
Sunita
On Tue, Oct 25, 2016 at 2:11 PM, Sunita Arvind
wrote:
> Attached is the edited code. Am I heading in right direc
Ajay,
Afaik Generally these contexts cannot be accessed within loops. The sql
query itself would run on distributed datasets so it's a parallel
execution. Putting them in foreach would make it nested in nested. So
serialization would become hard. Not sure I could explain it right.
If you can crea
Thanks for the response Sean. I have seen the NPE on similar issues very
consistently and assumed that could be the reason :) Thanks for clarifying.
regards
Sunita
On Tue, Oct 25, 2016 at 10:11 PM, Sean Owen wrote:
> This usage is fine, because you are only using the HiveContext locally on
> the
re I am not doing an overkill or overseeing a
potential issue.
regards
Sunita
On Tue, Oct 25, 2016 at 2:38 PM, Sunita Arvind
wrote:
> The error in the file I just shared is here:
>
> val partitionOffsetPath:String = topicDirs.consumerOffsetDir + "/" +
> partition._2(0); -
Hello Experts,
I am trying to allow null values in numeric fields. Here are the details of
the issue I have:
http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields
I also tried making all columns nullable by using the below function (from
one of the
Hello Spark Experts,
I have a design question w.r.t Spark Streaming. I have a streaming job that
consumes protocol buffer encoded real time logs from a Kafka cluster on
premise. My spark application runs on EMR (aws) and persists data onto s3.
Before I persist, I need to strip header and convert p
g even if there are hiccups or failures.
>
> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind
> wrote:
>
>> Hello Spark Experts,
>>
>> I have a design question w.r.t Spark Streaming. I have a streaming job
>> that consumes protocol buffer encoded real time logs
a
> few mins, you may have to end up creating a new file every few mins
>
> You may want to consider Kafka as your intermediary store for building a
> chain/DAG of streaming jobs
>
> On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind
> wrote:
>
>> Thanks for your response Michae
oint to S3. In my laptop,
they all point to local filesystem.
I am using Spark2.2.0
Appreciate your help.
regards
Sunita
On Wed, Aug 23, 2017 at 2:30 PM, Michael Armbrust
wrote:
> If you use structured streaming and the file sink, you can have a
> subsequent stream read using the fil
> Le 13 sept. 2017 01:51, "Sunita Arvind" a écrit :
>
> Hi Michael,
>
> I am wondering what I am doing wrong. I get error like:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Schema
> must be specified when creating a streaming source D
Hello Experts,
I am required to use a specific user id to save files on a remote hdfs
cluster. Remote in the sense, spark jobs run on EMR and write to a CDH
cluster. Hence I cannot change the hdfs-site.xml etc to point to the
destination cluster. As a result I am using webhdfs to save the files in
(UnsupportedOperationChecker.scala:297)
regards
Sunita
On Mon, Sep 18, 2017 at 10:15 AM, Michael Armbrust
wrote:
> You specify the schema when loading a dataframe by calling
> spark.read.schema(...)...
>
> On Tue, Sep 12, 2017 at 4:50 PM, Sunita Arvind
> wrote:
>
>> Hi Micha
Had a similar situation and landed on this question.
Finally I was able to make it do what I needed by cheating the spark driver
:)
i.e By setting a very high value for "--conf spark.task.maxFailures=800".
I made it 800 deliberately which typically is 4. So by the time 800
attempts for failed tasks
Hello Spark Experts,
I am having challenges using the DataSource V2 API. I created a mock
The input partitions seem to be created correctly. The below output
confirms that:
19/06/23 16:00:21 INFO root: createInputPartitions
19/06/23 16:00:21 INFO root: Create a partition for abc
The InputPartit
The below is not exactly a solution to your question but this is what we
are doing. For the first time we do end up doing row.getstring() and we
immediately parse it through a map function which aligns it to either a
case class or a structType. Then we register it as a table and use just
column nam
I was able to resolve this by adding rdd.collect() after every stage. This
enforced RDD evaluation and helped avoid the choke point.
regards
Sunita Kopppar
On Sat, Jan 17, 2015 at 12:56 PM, Sunita Arvind
wrote:
> Hi,
>
> My spark jobs suddenly started getting hung and here is the debu
Hi All
We are joining large tables using spark sql and running into shuffle
issues. We have explored multiple options - using coalesce to reduce number
of partitions, tuning various parameters like disk buffer, reducing data in
chunks etc. which all seem to help btw. What I would like to know is,
Hi Experts,
I have a large table with 54 million records (fact table), being joined
with 6 small tables (dimension tables). The size on disk of small tables is
within 5k and the record count is in the range of 4 - 200
All the worker nodes have RAM of 32GB allocated for spark. I have tried the
belo
Hi All,
I am trying to use a function within spark sql which accepts 2 - 4
arguments. I was able to get through compilation errors however, I see the
attached runtime exception when trying from Spark SQL.
(refer attachment for the complete stacktrace- StackTraceFor_runTestInSQL)
The function itse
Hello Experts,
I am attempting to integrate Spark Editor with Hue on CDH5.0.1. I have the
spark installation build manually from the sources for spark1.0.0. I am
able to integrate this with cloudera manager.
Background:
---
We have a 3 node VM cluster with CDH5.0.1
We requried spa
/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
>
> Romain
>
>
>
> Romain
>
>
> On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind > wrote:
>
>> Hello Experts,
>>
>> I am attempting to integrate Spark Editor with
Hi,
I am exploring GraphX library and trying to determine which usecases make
most sense for/with it. From what I initially thought, it looked like
GraphX could be applied to data stored in RDBMSs as Spark could translate
the relational data into graphical representation. However, there seems to
b
Thanks for the clarification Ankur
Appreciate it.
Regards
Sunita
On Monday, August 25, 2014, Ankur Dave wrote:
> At 2014-08-25 11:23:37 -0700, Sunita Arvind > wrote:
> > Does this "We introduce GraphX, which combines the advantages of both
> > data-parallel and gr
Hi All,
I just installed a spark on my laptop and trying to get spark-shell to
work. Here is the error I see:
C:\spark\bin>spark-shell
Exception in thread "main" java.util.NoSuchElementException: key not found:
CLAS
SPATH
at scala.collection.MapLike$class.default(MapLike.scala:228)
t;
> On Tue, Nov 25, 2014 at 11:09 PM, Akhil Das
> wrote:
>
>> You could try following this guidelines
>> http://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Nov 26, 2014 at 12:24
Hi,
I need to generate some flags based on certain columns and add it back to
the schemaRDD for further operations. Do I have to use case class
(reflection or programmatically). I am using parquet files, so schema is
being automatically derived. This is a great feature. thanks to Spark
developers,
Hi,
My spark jobs suddenly started getting hung and here is the debug leading
to it:
Following the program, it seems to be stuck whenever I do any collect(),
count or rdd.saveAsParquet file. AFAIK, any operation that requires data
flow back to master causes this. I increased the memory to 5 MB. Al
42 matches
Mail list logo