l.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto
>> overrides final method *getUnknownFields*
>> .()Lcom/google/protobuf/UnknownFieldSet;
>> at java.lang.ClassLoader.defineClass1(Native Method)
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>
>>
>>
>>
--
Chen Song
In Spark Streaming, when using updateStateByKey, it requires the generated
DStream to be checkpointed.
It seems that it always use JavaSerializer, no matter what I set for
spark.serializer. Can I use KryoSerializer for checkpointing? If not, I
assume the key and value types have to be Serializable
ion is, how can I prevent Spark from serializing the
parent class where RDD is defined, with still support of passing in
function defined in other classes?
--
Chen Song
ing even though I only pass foo (another serializable
> > object) in the closure?
> >
> > A more general question is, how can I prevent Spark from serializing the
> > parent class where RDD is defined, with still support of passing in
> > function defined in other classes?
> >
> > --
> > Chen Song
> >
>
--
Chen Song
t 4:09 PM, Chen Song wrote:
> Thanks Erik. I saw the document too. That is why I am confused because as
> per the article, it should be good as long as *foo *is serializable.
> However, what I have seen is that it would work if *testing* is
> serializable, even foo is not serializable,
further in the foreachPartition closure. So following by
>> that article, Scala will attempt to serialize the containing object/class
>> testing to get the foo instance.
>>
>> On Thu, Jul 9, 2015 at 4:11 PM, Chen Song wrote:
>>
>>> Repost the code exa
ce of data loss which can
>>>>>>> alleviated using Write Ahead Logs (see streaming programming guide for
>>>>>>> more
>>>>>>> details, or see my talk [Slides PDF
>>>>>>> <http://www.slideshare.net/SparkSummit/recipes-for-running-spark-streaming-apploications-in-production-tathagata-daspptx>
>>>>>>> , Video
>>>>>>> <https://www.youtube.com/watch?v=d5UJonrruHk&list=PL-x35fyliRwgfhffEpywn4q23ykotgQJ6&index=4>
>>>>>>> ] from last Spark Summit 2015). But that approach can give
>>>>>>> duplicate records. The direct approach gives exactly-once guarantees, so
>>>>>>> you should try it out.
>>>>>>>
>>>>>>> TD
>>>>>>>
>>>>>>> On Fri, Jun 26, 2015 at 5:46 PM, Cody Koeninger
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Read the spark streaming guide ad the kafka integration guide for a
>>>>>>>> better understanding of how the receiver based stream works.
>>>>>>>>
>>>>>>>> Capacity planning is specific to your environment and what the job
>>>>>>>> is actually doing, youll need to determine it empirically.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Friday, June 26, 2015, Shushant Arora
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> In 1.2 how to handle offset management after stream application
>>>>>>>>> starts in each job . I should commit offset after job completion
>>>>>>>>> manually?
>>>>>>>>>
>>>>>>>>> And what is recommended no of consumer threads. Say I have 300
>>>>>>>>> partitions in kafka cluster . Load is ~ 1 million events per
>>>>>>>>> second.Each
>>>>>>>>> event is of ~500bytes. Having 5 receivers with 60 partitions each
>>>>>>>>> receiver
>>>>>>>>> is sufficient for spark streaming to consume ?
>>>>>>>>>
>>>>>>>>> On Fri, Jun 26, 2015 at 8:40 PM, Cody Koeninger <
>>>>>>>>> c...@koeninger.org> wrote:
>>>>>>>>>
>>>>>>>>>> The receiver-based kafka createStream in spark 1.2 uses zookeeper
>>>>>>>>>> to store offsets. If you want finer-grained control over offsets,
>>>>>>>>>> you can
>>>>>>>>>> update the values in zookeeper yourself before starting the job.
>>>>>>>>>>
>>>>>>>>>> createDirectStream in spark 1.3 is still marked as experimental,
>>>>>>>>>> and subject to change. That being said, it works better for me in
>>>>>>>>>> production than the receiver based api.
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 26, 2015 at 6:43 AM, Shushant Arora <
>>>>>>>>>> shushantaror...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am using spark streaming 1.2.
>>>>>>>>>>>
>>>>>>>>>>> If processing executors get crashed will receiver rest the
>>>>>>>>>>> offset back to last processed offset?
>>>>>>>>>>>
>>>>>>>>>>> If receiver itself got crashed is there a way to reset the
>>>>>>>>>>> offset without restarting streaming application other than smallest
>>>>>>>>>>> or
>>>>>>>>>>> largest.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Is spark streaming 1.3 which uses low level consumer api,
>>>>>>>>>>> stabe? And which is recommended for handling data loss 1.2 or 1.3 .
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>
--
Chen Song
the offset ranges for a given rdd in the stream by
>> typecasting to HasOffsetRanges. You can then store the offsets wherever
>> you need to.
>>
>> On Tue, Jul 14, 2015 at 5:00 PM, Chen Song
>> wrote:
>>
>>> A follow up question.
>>>
>>>
batch interval.
* Is that even possible?
* The reason I think of this is because user can get a list of RDDs by
DStream.window.slice but I cannot find a way to get the most recent RDD in
the DSteam.
--
Chen Song
n before asynchronous query has completed. Use this.
>
> streamingContext.remember( latest RDD>)
>
>
> On Tue, Jul 14, 2015 at 6:57 PM, Chen Song wrote:
>
>> I have been POC adding a rest service in a Spark Streaming job. Say I
>> create a stateful DStream X by us
Thanks TD.
As for 1), if timing is not guaranteed, how does exactly once semantics
supported? It feels like exactly once receiving is not necessarily exactly
once processing.
Chen
On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das wrote:
>
>
> On Tue, Jul 14, 2015 at 6:42 PM, Chen So
(StreamingContext.scala:586)
--
Chen Song
fixed in 1.4.1 about to come out in the
> next couple of days. That should help you to narrow down the culprit
> preventing serialization.
>
> On Wed, Jul 15, 2015 at 1:12 PM, Ted Yu wrote:
>
>> Can you show us your function(s) ?
>>
>> Thanks
>>
>>
RK-8091] Fix a number of
>> SerializationDebugger bugs and limitations
>> ...
>> Closes #6625 from tdas/SPARK-7180 and squashes the following commits:
>>
>> On Wed, Jul 15, 2015 at 2:37 PM, Chen Song
>> wrote:
>>
>>> Thanks
>>>
>>> Can y
Hi
I ran into problems to use class loader in Spark. In my code (run within
executor), I explicitly load classes using the ContextClassLoader as below.
Thread.currentThread().getContextClassLoader()
The jar containing the classes to be loaded is added via the --jars option
in spark-shell/spark-s
Sorry to spam people who are not interested. Greatly appreciate it if
anyone who is familiar with this can share some insights.
On Wed, Jul 6, 2016 at 2:28 PM Chen Song wrote:
> Hi
>
> I ran into problems to use class loader in Spark. In my code (run within
> executor), I exp
marco
>
> On Thu, Jul 7, 2016 at 5:05 PM, Chen Song wrote:
>
>> Sorry to spam people who are not interested. Greatly appreciate it if
>> anyone who is familiar with this can share some insights.
>>
>> On Wed, Jul 6, 2016 at 2:28 PM Chen Song wrote:
>>
&g
y
> break other things (like: dependencies that Spark master required
> initializing overridden by Spark app and so on) so, you will need to verify.
>
> [1] https://spark.apache.org/docs/latest/configuration.html
>
> On Thu, Jul 7, 2016 at 4:05 PM, Chen Song wrote:
>
>&
For Kafka direct stream, is there a way to set the time between successive
retries? From my testing, it looks like it is 200ms. Any way I can increase
the time?
Apologize in advance if someone has already asked and addressed this
question.
In Spark Streaming, how can I programmatically get the batch statistics
like schedule delay, total delay and processing time (They are shown in the
job UI streaming tab)? I need such information to raise alerts in some
x27;s your opinion about my options? How do you people solve this
> problem? Anything Spark specific?
> I'll be grateful for any advice in this subject.
> Thanks!
> Krzysiek
>
>
--
Chen Song
We have a use case with the following design in Spark Streaming.
Within each batch,
* data is read and partitioned by some key
* forEachPartition is used to process the entire partition
* within each partition, there are several REST clients created to connect
to different REST services
* for the
6:07 PM, Ashish Soni wrote:
>
>> Need more details but you might want to filter the data first ( create
>> multiple RDD) and then process.
>>
>>
>> > On Oct 5, 2015, at 8:35 PM, Chen Song wrote:
>> >
>> > We have a use case with the followin
so,
how to handle failover of namenodes for a streaming job in Spark.
Thanks for your feedback in advance.
--
Chen Song
is right? The class relevant is below,
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
Chen
On Tue, Nov 3, 2015 at 11:57 AM, Chen Song wrote:
> We saw the following error happening in Spark Streaming job. Our job is
that is due to the hadoop jar for cdh5.4.0 is built with JDK 7.
Anyone has done this before?
Thanks,
--
Chen Song
Thanks Sean.
So how PySpark is supported. I thought PySpark needs jdk 1.6.
Chen
On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen wrote:
> Spark 1.4 requires Java 7.
>
> On Fri, Aug 21, 2015, 3:12 PM Chen Song wrote:
>
>> I tried to build Spark 1.4.1 on cdh 5.4.0. Because
gated logs on hdfs
after the job is killed or failed.
Is there a YARN config to keep the logs from being deleted for long-lived
streaming job?
--
Chen Song
h spark streaming
> (which kind of blocks your resources). On the other hand, if you use job
> server then you can actually use the resources (CPUs) for other jobs also
> when your dbjob is not using them.
>
> Thanks
> Best Regards
>
> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha wrote:
>
> Hi All
>
> I have a requireent to connect to a DB every few minutes and bring data to
> HBase. Can anyone suggest if spark streaming would be appropriate for this
> senario or I shoud look into jobserver?
>
> Thanks in advance
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
--
Chen Song
nly changes every few days, why not restart the job every
> few days, and just broadcast the data?
>
> Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to
> avoid many mysql reads
>
> On Wed, Aug 26, 2015 at 9:46 AM, Chen Song wrote:
>
>> Piggyback on
Anyone has similar problem or thoughts on this?
On Wed, Aug 26, 2015 at 10:37 AM, Chen Song wrote:
> When running long-lived job on YARN like Spark Streaming, I found that
> container logs gone after days on executor nodes, although the job itself
> is still running.
>
>
> I
data. Each partition in
derivedStream only needs to be joined with the corresponding partition in
the original parent inputStream it is generated from.
My question is
1. Is there a Partitioner defined in KafkaRDD at all?
2. How would I preserve the partitioning scheme and avoid data shuffle?
--
I tried both of the following with STS but neither works for me.
Starting STS with --hiveconf hive.limit.optimize.fetch.max=50
and
Setting common.max_count in Zeppelin
Without setting such limits, a query that outputs lots of rows could cause
the driver to OOM and makes TS unusable. Any workaro
quot;Streaming" tab, it show messages below:
Batch Processing Statistics
No statistics have been generated yet.
Am I doing anything wrong on the parallel receiving part?
--
Chen Song
--
Chen Song
I have a map reduce job that reads from three logs and joins them on some
key column. The underlying data is protobuf messages in sequence
files. Between mappers and reducers, the underlying raw byte arrays for
protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is
2G data outpu
Anyone can shed some light on this?
On Tue, Mar 17, 2015 at 5:23 PM, Chen Song wrote:
> I have a map reduce job that reads from three logs and joins them on some
> key column. The underlying data is protobuf messages in sequence
> files. Between mappers and reducers, the underlying
Using spark 1.3.0 on cdh5.1.0, I was running a fetch failed exception.
I searched in this email list but not found anything like this reported.
What could be the reason for the error?
org.apache.spark.shuffle.FetchFailedException: [EMPTY_INPUT] Cannot
decompress empty stream
at
org.apach
g.
Anyone has seen this before?
--
Chen Song
imply
windowing scenario as which date the data belongs to is not the same thing
when the data arrives). How can I handle this to tell spark to discard data
for old dates?
Thank you,
Best
Chen
--
Chen Song
Hey
I am new to spark streaming and apologize if these questions have been
asked.
* In StreamingContext, reduceByKey() seems to only work on the RDDs of the
current batch interval, not including RDDs of previous batches. Is my
understanding correct?
* If the above statement is correct, what func
Thanks Anwar.
On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal wrote:
>
> On Tue, Jun 17, 2014 at 5:39 PM, Chen Song wrote:
>
>> Hey
>>
>> I am new to spark streaming and apologize if these questions have been
>> asked.
>>
>> * In StreamingContext, re
job to load the set of values from disk and cache to each
worker?
--
Chen Song
I am new to spark streaming and wondering if spark streaming tracks
counters (e.g., how many rows in each consumer, how many rows routed to an
individual reduce task, etc.) in any form so I can get an idea of how data
is skewed? I checked spark job page but don't seem to find any.
--
Chen Song
ually for spark streaming and doesn't seem to be question in this
group. But I am raising it anyway here.)
I have looked at code example below but doesn't seem it is supported.
KafkaUtils.createStream ...
Thanks, All
--
Chen Song
t a lot of pieces to
> make sense of the flow..
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Mon, Jun 30, 2014 at 9:58 PM, Chen Song wrote:
>
>> I am new to
.
Any idea on how I should do this?
--
Chen Song
Thanks Luis and Tobias.
On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer wrote:
> Hi,
>
> On Wed, Jul 2, 2014 at 1:57 AM, Chen Song wrote:
>>
>> * Is there a way to control how far Kafka Dstream can read on
>> topic-partition (via offset for example). By setting th
t works.
> Andrew
>
>
> 2014-07-17 18:15 GMT-07:00 Tathagata Das :
>
> Can you check in the environment tab of Spark web ui to see whether this
>> configuration parameter is in effect?
>>
>> TD
>>
>>
>> On Thu, Jul 17, 2014 at 2:05 PM, Che
-XX:+UseConcMarkSweepGC -Dspark.config.one=value
> -Dspark.config.two=value"
>
> (Please note that this is only for Spark 0.9. The part where we set Spark
> options within SPARK_JAVA_OPTS is deprecated as of 1.0)
>
>
> 2014-07-17 21:08 GMT-07:00 Chen Song :
>
> Thanks Andr
Jul 18, 2014 at 8:11 AM, Bill Jay
>> wrote:
>>
>>> I also have an issue consuming from Kafka. When I consume from Kafka,
>>> there are always a single executor working on this job. Even I use
>>> repartition, it seems that there is still a single executor. Does an
many connections are made to Kafka. Note that the number of
>> Kafka partitions for that topic must be at least N for this to work.
>>
>> Thanks
>> Tobias
>>
>
>
--
Chen Song
/day.
--
Chen Song
, no matter what value I set dfs.blocksize on
driver/executor. I am wondering if there is a way to increase parallelism,
say let each map read 128M data and increase the number of map tasks?
--
Chen Song
, Jul 25, 2014 at 1:32 PM, Bill Jay
>> wrote:
>>
>>> Hi,
>>>
>>> I am running a Spark Streaming job that uses saveAsTextFiles to save
>>> results into hdfs files. However, it has an exception after 20 batches
>>>
>>>
>>> result-140631234/_temporary/0/task_201407251119__m_03 does not
>>> exist.
>>>
>>>
>>> When the job is running, I do not change any file in the folder. Does
>>> anyone know why the file cannot be found?
>>>
>>> Thanks!
>>>
>>> Bill
>>>
>>
>>
>
--
Chen Song
The exception was thrown out in application master(spark streaming driver)
and the job shut down after this exception.
On Mon, Aug 11, 2014 at 10:29 AM, Chen Song wrote:
> I got the same exception after the streaming job runs for a while, The
> ERROR message was complaining about a tem
Bill
Did you get this resolved somehow? Anyone has any insight into this problem?
Chen
On Mon, Aug 11, 2014 at 10:30 AM, Chen Song wrote:
> The exception was thrown out in application master(spark streaming driver)
> and the job shut down after this exception.
>
>
> On Mon,
n seeing similar stacktraces on Spark core (not streaming)
> and have a theory it's related to spark.speculation being turned on. Do
> you have that enabled by chance?
>
>
> On Mon, Aug 11, 2014 at 8:10 AM, Chen Song wrote:
>
>> Bill
>>
>> Did you get this r
could probably set the system property
> "mapreduce.input.fileinputformat.split.maxsize".
>
> Regards,
> Paul Hamilton
>
> From: Chen Song
> Date: Friday, August 8, 2014 at 9:13 PM
> To: "user@spark.apache.org"
> Subject: increase parallelism of reading from hdfs
>
>
> In S
re's no reliable way to run large jobs on Spark
>> right now!
>>
>> I'm going to file a bug for the _temporary and LeaseExpiredExceptions as
>> I think these are widespread enough that we need a place to track a
>> resolution.
>>
>>
>&
/16 06:48:22 INFO BlockManagerMaster: Trying to register BlockManager
14/09/16 06:48:22 INFO BlockManagerMaster: Registered BlockManager
14/09/16 06:48:22 INFO BlockManager: Reporting 63 blocks to the master.
--
Chen Song
)
--
Chen Song
:56 libsnappy.so ->
libsnappy.so.1.1.3
Chen
On Sat, Sep 20, 2014 at 9:31 AM, Ted Yu wrote:
> Can you tell us how you installed native snappy ?
>
> See D.3.1.5 in:
> http://hbase.apache.org/book.html#snappy.compression.installation
>
> Cheers
>
> On Sat, Sep 20, 2014 at 2:1
I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
following exception.
java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
I am running the job on 500 executors, each with 8G and 1 core.
See lots of fetch failures on reduce stage, when running a simple
reduceByKey
map tasks -> 4000
reduce tasks -> 200
On Mon, Sep 22, 2014 at 12:22 PM, Chen Song wrote:
> I am using Spark 1.1.0 and have seen a lot
t; val trainingDataTable = sql("""SELECT prod.prod_num,
> demographics.gender, demographics.birth_year, demographics.income_group
> FROM prod p JOIN demographics d ON d.user_id = p.user_id""")
>
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> MultiInstanceRelations
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> CaseInsensitiveAttributeReferences
> java.lang.RuntimeException: Table Not Found: prod.
>
> I have these tables in hive. I used show tables command to confirm this.
> Can someone please let me know how do I make them accessible here?
>
>
>
>
>
>
--
Chen Song
ion.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at com.appnexus.data.spark.sql.Test$.main(Test.scala:23)
at com.appnexus.data.spark.sql.Test.main(Test.scala)
... 5 more
--
Chen Song
attributes) when running simple select on valid column names and malformed
column names. This lead me to suspect that syntactical breaks somewhere.
select [valid_column] from table limit 5;
select [malformed_typo_column] from table limit 5;
On Mon, Oct 13, 2014 at 6:04 PM, Chen Song wrote:
> In H
Looks like it may be related to
https://issues.apache.org/jira/browse/SPARK-3807.
I will build from branch 1.1 to see if the issue is resolved.
Chen
On Tue, Oct 14, 2014 at 10:33 AM, Chen Song wrote:
> Sorry for bringing this out again, as I have no clue what could have
> caused this.
his message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185p15869.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
--
Chen Song
Silly question, what is the best way to shuffle protobuf messages in Spark
(Streaming) job? Can I use Kryo on top of protobuf Message type?
--
Chen Song
User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Chen Song
Anyone has suggestions?
On Tue, Dec 23, 2014 at 3:08 PM, Chen Song wrote:
> Silly question, what is the best way to shuffle protobuf messages in Spark
> (Streaming) job? Can I use Kryo on top of protobuf Message type?
>
> --
> Chen Song
>
>
--
Chen Song
73 matches
Mail list logo