Is there a proper way to make or get an Encoder for Option in Spark 2.0?
There isn't one by default and while ExpressionEncoder from catalyst will
work, it is private and unsupported.
--
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/>
gt;
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations. We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the equivalent be in the Dataset interface - I do not see a
>> sim
is likely fixed in 2.0. If you can get a reproduction
> working there it would be very helpful if you could open a JIRA.
>
> On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher > wrote:
>
>> A quick unit test attempt didn't get far replacing map with as[], I'm
>>
e the default.
>
> That said, I would like to enable that kind of sugar while still taking
> advantage of all the optimizations going on under the covers. Can you get
> it to work if you use `as[...]` instead of `map`?
>
> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
&
6
On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust
wrote:
> Thanks for the feedback. I think this will address at least some of the
> problems you are describing: https://github.com/apache/spark/pull/13425
>
> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher > wrote:
>
ample, defaults to -1 instead of null. Now it's
completely ambiguous what data in the join was actually there versus
populated via this atypical semantic.
Are there additional options available to work around this issue? I can
convert to RDD and back to Dataset but that's less than ide
was able to trace down the leak to a couple thread
pools that were not shut down properly by noticing the named threads
accumulating in thread dumps of the JVM process.
On Mon, Dec 7, 2015 at 6:41 PM, Richard Marscher
wrote:
> Thanks for the response.
>
> The version is Spark 1.5.2.
Best Regards,
> Shixiong Zhu
>
> 2015-12-07 14:30 GMT-08:00 Richard Marscher :
>
>> Hi,
>>
>> I've been running benchmarks against Spark in local mode in a long
>> running process. I'm seeing threads leaking each time it runs a job. It
>> doesn'
w.
What I'm curious to know is if anyone has seen a similar issue?
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>
ldin EDU
wrote:
> You can try to run "jstack" a couple of times while the app is hung to
> look for patterns for where the app is hung.
> --
> Ali
>
>
> On Dec 3, 2015, at 8:27 AM, Richard Marscher
> wrote:
>
> I should add that the pauses are not from G
I should add that the pauses are not from GC and also in tracing the CPU
call tree in the JVM it seems like nothing is doing any work, just seems to
be idling or blocking.
On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher
wrote:
> Hi,
>
> I'm doing some testing of workloads using
to periodically
pause for such a long time.
Has anyone else seen similar behavior or aware of some quirk of local mode
that could cause this kind of blocking?
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.c
f memory in
the cluster. I think that's a fundamental problem up front. If it's skewed
then that will be even worse for doing aggregation. I think to start the
data either needs to be broken down or the cluster upgraded unfortunately.
On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher
wrote:
fter the
> fact, when temporary files have been cleaned up).
>
> Has anyone run into something like this before? I would be happy to see
> OOM errors, because that would be consistent with one understanding of what
> might be going on, but I haven't yet.
>
> Eric
>
>
ark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Richard Marscher*
Software Engineer
L
*Mike Wright*
> Principal Architect, Software Engineering
>
> SNL Financial LC
> 434-951-7816 *p*
> 434-244-4466 *f*
> 540-470-0119 *m*
>
> mwri...@snl.com
>
> On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher > wrote:
>
>> That seems like it could work,
View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For a
m/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------
> To unsubscribe, e-mail: user-unsubscr...@s
gt;
>>
> --
> Patanachai
>
>
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Richard Marscher*
Software Engine
ncrease the nbr of partitions for the RDD to make use of the
> cluster. Is calling repartition() after this the only option or can I pass
> something in the above method to have more partitions of the RDD.
>
> Please let me know.
>
> Thanks.
>
--
*Richard Marscher*
S
it
> to every node (assuming B is relative small to fit)?
>
>
>
> On Tue, Aug 4, 2015 at 8:23 AM, Richard Marscher
> wrote:
> > Yes it does, in fact it's probably going to be one of the more expensive
> > shuffles you could trigger.
> >
> > On Mon, Aug 3, 201
t; myRDD.repartition(10)
>>>
>>> .mapPartitions(keyValues => getResults(keyValues))
>>>>
>>>
>>> The mapPartitions does some initialization to the SolrJ client per
>>> partition and then hits it for each record in the partition
---
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <
hreds in spark executor ? Is it a
> good idea to create user threads in spark map task?
>
> Thanks
>
>
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/loc
Is there any guide or link which I can refer.
>
> Thank you very much.
>
> -Naveen
>
>
>
>
--
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/loc
p://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>
ct not serializable (class:
>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value:
>>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824)
>>> > - field (class:
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
>>> &g
User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/>
te:
> My singletons do in fact stick around. They're one per worker, looks
> like. So with 4 workers running on the box, we're creating one singleton
> per worker process/jvm, which seems OK.
>
> Still curious about foreachPartition vs. foreachRDD though...
>
>
Would it be possible to have a wrapper class that just represents a
reference to a singleton holding the 3rd party object? It could proxy over
calls to the singleton object which will instantiate a private instance of
the 3rd party object lazily? I think something like this might work if the
worker
This should work
val output: RDD[(DetailInputRecord, VISummary)] =
sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)])
On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
> I need to return an empty RDD of type
>
> val output: RDD[(DetailInputRecord, VISummary)]
>
>
>
> This does not wor
.
On Thu, Jul 2, 2015 at 4:37 AM, Sjoerd Mulder
wrote:
> Hi Richard,
>
> I have actually applied the following fix to our 1.4.0 version and this
> seem to resolve the zombies :)
>
> https://github.com/apache/spark/pull/7077/files
>
> Sjoerd
>
> 2015-06-26 20:08 GMT+0
Hi,
not sure what the context is but I think you can do something similar with
mapPartitions:
rdd.mapPartitions { iterator =>
iterator.grouped(5).map { tupleGroup => emitOneRddForGroup(tupleGroup) }
}
The edge case is when the final grouping doesn't have exactly 5 items, if
that matters.
On
Regards,
Richard
On Fri, Jun 26, 2015 at 12:08 PM, Sjoerd Mulder
wrote:
> Hi Richard,
>
> I would like to see how we can get a workaround to get out of the Zombie
> situation since were planning for production :)
>
> If you could share the workaround or point directions that woul
We've seen this issue as well in production. We also aren't sure what
causes it, but have just recently shaded some of the Spark code in
TaskSchedulerImpl that we use to effectively bubble up an exception from
Spark instead of zombie in this situation. If you are interested I can go
into more detai
There should be no difference assuming you don't use the intermediately
stored rdd values you are creating for anything else (rdd1, rdd2). In the
first example it still is creating these intermediate rdd objects you are
just using them implicitly and not storing the value.
It's also worth pointing
Hi,
can you detail the symptom further? Was it that only 12 requests were
services and the other 440 timed out? I don't think that Spark is well
suited for this kind of workload, or at least the way it is being
represented. How long does a single request take Spark to complete?
Even with fair sch
Is spoutLog just a non-spark file writer? If you run that in the map call
on a cluster its going to be writing in the filesystem of the executor its
being run on. I'm not sure if that's what you intended.
On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla
wrote:
> Running perfectly in local system bu
It would be the "40%", although it's probably better to think of it as
shuffle vs. data cache and the remainder goes to tasks. As the comments for
the shuffle memory fraction configuration clarify that it will be taking
memory at the expense of the storage/data cache fraction:
spark.shuffle.memory
Hi,
if you now want to write 1 file per partition, that's actually built into
Spark as
*saveAsTextFile*(*path*)Write the elements of the dataset as a text file
(or set of text files) in a given directory in the local filesystem, HDFS
or any other Hadoop-supported file system. Spark will call toSt
Hi,
I don't have a complete answer to your questions but:
"Removing the suffix does not solve the problem" -> unfortunately this is
true, the master web UI only tries to build out a Spark UI from the event
logs once, at the time the context is closed. If the event logs are
in-progress at this tim
Hi,
we've been seeing occasional issues in production with the FileOutCommitter
reaching a deadlock situation.
We are writing our data to S3 and currently have speculation enabled. What
we see is that Spark get's a file not found error trying to access a
temporary part file that it wrote (part-#2
I think if you create a bidirectional mapping from AnalyticsEvent to
another type that would wrap it and use the nonce as its equality, you
could then do something like reduceByKey to group by nonce and map back to
AnalyticsEvent after.
On Thu, Jun 4, 2015 at 1:10 PM, lbierman wrote:
> I'm still
It is possible to start multiple concurrent drivers, Spark dynamically
allocates ports per "spark application" on driver, master, and workers from
a port range. When you collect results back to the driver, they do not go
through the master. The master is mostly there as a coordinator between the
dr
I think the short answer to the question is, no, there is no alternate API
that will not use the System.exit calls. You can craft a workaround like is
being suggested in this thread. For comparison, we are doing programmatic
submission of applications in a long-running client application. To get
ar
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed
in 1.3.1 and up.
https://issues.apache.org/jira/browse/SPARK-6036
The ordering that the event logs are moved from in-progress to complete is
coded to be after the Master tries to build the history page for the logs.
The onl
Are you sure it's memory related? What is the disk utilization and IO
performance on the workers? The error you posted looks to be related to
shuffle trying to obtain block data from another worker node and failing to
do so in reasonable amount of time. It may still be memory related, but I'm
not s
Ah, apologies, I found an existing issue and fix has already gone out for
this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036.
On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher
wrote:
> It looks like it is possibly a race condition between removing the
> IN_PROGRE
osed before the dagScheduler?
Thanks,
Richard
On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher
wrote:
> Hi,
>
> In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
> folder on a Standalone cluster. This is generally working, all the logs are
> being written. H
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority of
completed applications are labeled as not having a history:
http://xxx.xxx.xxx
mation that FAIR scheduling is
> supported for Spark Streaming Apps
>
>
>
> *From:* Richard Marscher [mailto:rmarsc...@localytics.com]
> *Sent:* Friday, May 15, 2015 7:20 PM
> *To:* Evo Eftimov
> *Cc:* Tathagata Das; user
> *Subject:* Re: Spark Fair Scheduler for Spark St
The doc is a bit confusing IMO, but at least for my application I had to
use a fair pool configuration to get my stages to be scheduled with FAIR.
On Fri, May 15, 2015 at 2:13 PM, Evo Eftimov wrote:
> No pools for the moment – for each of the apps using the straightforward
> way with the spark c
I believe the issue in b and c is that you call iter.size which actually is
going to flush the iterator so the subsequent attempt to put it into a
vector will yield 0 items. You could use an ArrayBuilder for example and
not need to rely on knowing the size of the iterator.
On Mon, May 11, 2015 at
I should also add I've recently seen this issue as well when using collect.
I believe in my case it was related to heap space on the driver program not
being able to handle the returned collection.
On Thu, May 7, 2015 at 11:05 AM, Richard Marscher
wrote:
> By default you would expect
By default you would expect to find the logs files for master and workers
in the relative `logs` directory from the root of the Spark installation on
each of the respective nodes in the cluster.
On Thu, May 7, 2015 at 10:27 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
> Ø C
Hi,
I think you may want to use this setting?:
spark.task.maxFailures4Number of individual task failures before giving up
on the job. Should be greater than or equal to 1. Number of allowed retries
= this value - 1.
On Thu, May 7, 2015 at 2:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
> How i can stop Spark to
Hi,
do you have information on how many partitions/tasks the stage/job is
running? By default there is 1 core per task, and your number of concurrent
tasks may be limiting core utilization.
There are a few settings you could play with, assuming your issue is
related to the above:
spark.default.pa
In regards to the large GC pauses, assuming you allocated all 100GB of
memory per worker you may consider running with less memory on your Worker
nodes, or splitting up the available memory on the Worker nodes amongst
several worker instances. The JVM's garbage collection starts to become
very slow
I'm not sure, but I wonder if because you are using the Spark REPL that it
may not be representing what a normal runtime execution would look like and
is possibly eagerly running a partial DAG once you define an operation that
would cause a shuffle.
What happens if you setup your same set of comma
Hi,
I can offer a few ideas to investigate in regards to your issue here. I've
run into resource issues doing shuffle operations with a much smaller
dataset than 2B. The data is going to be saved to disk by the BlockManager
as part of the shuffle and then redistributed across the cluster as
releva
exander
wrote:
> Hi Richard,
>
>
>
> There are several values of time per id. Is there a way to perform group
> by id and sort by time in Spark?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Richard Marscher [mailto:rmarsc...@localytics.com]
> *Se
Hi,
that error seems to indicate the basic query is not properly expressed. If
you group by just ID, then that means it would need to aggregate all the
time values into one value per ID, so you can't sort by it. Thus it tries
to suggest an aggregate function for time so you can have 1 value per ID
Could you possibly describe what you are trying to learn how to do in more
detail? Some basics of submitting programmatically:
- Create a SparkContext instance and use that to build your RDDs
- You can only have 1 SparkContext per JVM you are running, so if you need
to satisfy concurrent job reque
If it fails with sbt-assembly but not without it, then there's always the
likelihood of a classpath issue. What dependencies are you rolling up into
your assembly jar?
On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa
wrote:
> Any ideas ?
>
> On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa
Hi,
I've gotten an application working with sbt-assembly and spark, thought I'd
present an option. In my experience, trying to bundle any of the Spark
libraries in your uber jar is going to be a major pain. There will be a lot
of deduplication to work through and even if you resolve them it can be
65 matches
Mail list logo