Option Encoder

2016-06-23 Thread Richard Marscher
Is there a proper way to make or get an Encoder for Option in Spark 2.0? There isn't one by default and while ExpressionEncoder from catalyst will work, it is private and unsupported. -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/>

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher
gt; >> Hello. >> >> I am looking at the option of moving RDD based operations to Dataset >> based operations. We are calling 'reduceByKey' on some pair RDDs we have. >> What would the equivalent be in the Dataset interface - I do not see a >> sim

Re: Dataset Outer Join vs RDD Outer Join

2016-06-07 Thread Richard Marscher
is likely fixed in 2.0. If you can get a reproduction > working there it would be very helpful if you could open a JIRA. > > On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher > wrote: > >> A quick unit test attempt didn't get far replacing map with as[], I'm >>

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Richard Marscher
e the default. > > That said, I would like to enable that kind of sugar while still taking > advantage of all the optimizations going on under the covers. Can you get > it to work if you use `as[...]` instead of `map`? > > On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher < &

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
6 On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust wrote: > Thanks for the feedback. I think this will address at least some of the > problems you are describing: https://github.com/apache/spark/pull/13425 > > On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher > wrote: >

Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
ample, defaults to -1 instead of null. Now it's completely ambiguous what data in the join was actually there versus populated via this atypical semantic. Are there additional options available to work around this issue? I can convert to RDD and back to Dataset but that's less than ide

Re: Local Mode: Executor thread leak?

2015-12-08 Thread Richard Marscher
was able to trace down the leak to a couple thread pools that were not shut down properly by noticing the named threads accumulating in thread dumps of the JVM process. On Mon, Dec 7, 2015 at 6:41 PM, Richard Marscher wrote: > Thanks for the response. > > The version is Spark 1.5.2.

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
Best Regards, > Shixiong Zhu > > 2015-12-07 14:30 GMT-08:00 Richard Marscher : > >> Hi, >> >> I've been running benchmarks against Spark in local mode in a long >> running process. I'm seeing threads leaking each time it runs a job. It >> doesn'

Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
w. What I'm curious to know is if anyone has seen a similar issue? -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>

Re: Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
ldin EDU wrote: > You can try to run "jstack" a couple of times while the app is hung to > look for patterns for where the app is hung. > -- > Ali > > > On Dec 3, 2015, at 8:27 AM, Richard Marscher > wrote: > > I should add that the pauses are not from G

Re: Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
I should add that the pauses are not from GC and also in tracing the CPU call tree in the JVM it seems like nothing is doing any work, just seems to be idling or blocking. On Thu, Dec 3, 2015 at 11:24 AM, Richard Marscher wrote: > Hi, > > I'm doing some testing of workloads using

Local mode: Stages hang for minutes

2015-12-03 Thread Richard Marscher
to periodically pause for such a long time. Has anyone else seen similar behavior or aware of some quirk of local mode that could cause this kind of blocking? -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.c

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
f memory in the cluster. I think that's a fundamental problem up front. If it's skewed then that will be even worse for doing aggregation. I think to start the data either needs to be broken down or the cluster upgraded unfortunately. On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher wrote:

Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Richard Marscher
fter the > fact, when temporary files have been cleaned up). > > Has anyone run into something like this before? I would be happy to see > OOM errors, because that would be consistent with one understanding of what > might be going on, but I haven't yet. > > Eric > >

Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
ark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Marscher* Software Engineer L

Re: New to Spark - Paritioning Question

2015-09-09 Thread Richard Marscher
*Mike Wright* > Principal Architect, Software Engineering > > SNL Financial LC > 434-951-7816 *p* > 434-244-4466 *f* > 540-470-0119 *m* > > mwri...@snl.com > > On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher > wrote: > >> That seems like it could work,

Re: New to Spark - Paritioning Question

2015-09-08 Thread Richard Marscher
View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For a

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-08 Thread Richard Marscher
m/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------- > To unsubscribe, e-mail: user-unsubscr...@s

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Richard Marscher
gt; >> > -- > Patanachai > > > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Marscher* Software Engine

Re: Repartition question

2015-08-04 Thread Richard Marscher
ncrease the nbr of partitions for the RDD to make use of the > cluster. Is calling repartition() after this the only option or can I pass > something in the above method to have more partitions of the RDD. > > Please let me know. > > Thanks. > -- *Richard Marscher* S

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
it > to every node (assuming B is relative small to fit)? > > > > On Tue, Aug 4, 2015 at 8:23 AM, Richard Marscher > wrote: > > Yes it does, in fact it's probably going to be one of the more expensive > > shuffles you could trigger. > > > > On Mon, Aug 3, 201

Re: How to increase parallelism of a Spark cluster?

2015-08-04 Thread Richard Marscher
t; myRDD.repartition(10) >>> >>> .mapPartitions(keyValues => getResults(keyValues)) >>>> >>> >>> The mapPartitions does some initialization to the SolrJ client per >>> partition and then hits it for each record in the partition

Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Richard Marscher
--- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <

Re: user threads in executors

2015-07-21 Thread Richard Marscher
hreds in spark executor ? Is it a > good idea to create user threads in spark map task? > > Thanks > > -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/loc

Re: Unit tests of spark application

2015-07-10 Thread Richard Marscher
Is there any guide or link which I can refer. > > Thank you very much. > > -Naveen > > > > -- -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/loc

Re: Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-10 Thread Richard Marscher
p://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>

Re: Spark serialization in closure

2015-07-09 Thread Richard Marscher
ct not serializable (class: >>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value: >>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824) >>> > - field (class: >>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1, >>> &g

Re: Create RDD from output of unix command

2015-07-08 Thread Richard Marscher
User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/>

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Richard Marscher
te: > My singletons do in fact stick around. They're one per worker, looks > like. So with 4 workers running on the box, we're creating one singleton > per worker process/jvm, which seems OK. > > Still curious about foreachPartition vs. foreachRDD though... > >

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-07 Thread Richard Marscher
Would it be possible to have a wrapper class that just represents a reference to a singleton holding the 3rd party object? It could proxy over calls to the singleton object which will instantiate a private instance of the 3rd party object lazily? I think something like this might work if the worker

Re: How to create empty RDD

2015-07-06 Thread Richard Marscher
This should work val output: RDD[(DetailInputRecord, VISummary)] = sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)]) On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I need to return an empty RDD of type > > val output: RDD[(DetailInputRecord, VISummary)] > > > > This does not wor

Re: Spark driver hangs on start of job

2015-07-02 Thread Richard Marscher
. On Thu, Jul 2, 2015 at 4:37 AM, Sjoerd Mulder wrote: > Hi Richard, > > I have actually applied the following fix to our 1.4.0 version and this > seem to resolve the zombies :) > > https://github.com/apache/spark/pull/7077/files > > Sjoerd > > 2015-06-26 20:08 GMT+0

Re: Applying functions over certain count of tuples .

2015-06-29 Thread Richard Marscher
Hi, not sure what the context is but I think you can do something similar with mapPartitions: rdd.mapPartitions { iterator => iterator.grouped(5).map { tupleGroup => emitOneRddForGroup(tupleGroup) } } The edge case is when the final grouping doesn't have exactly 5 items, if that matters. On

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
Regards, Richard On Fri, Jun 26, 2015 at 12:08 PM, Sjoerd Mulder wrote: > Hi Richard, > > I would like to see how we can get a workaround to get out of the Zombie > situation since were planning for production :) > > If you could share the workaround or point directions that woul

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
We've seen this issue as well in production. We also aren't sure what causes it, but have just recently shaded some of the Spark code in TaskSchedulerImpl that we use to effectively bubble up an exception from Spark instead of zombie in this situation. If you are interested I can go into more detai

Re: How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Richard Marscher
There should be no difference assuming you don't use the intermediately stored rdd values you are creating for anything else (rdd1, rdd2). In the first example it still is creating these intermediate rdd objects you are just using them implicitly and not storing the value. It's also worth pointing

Re: Limitations using SparkContext

2015-06-23 Thread Richard Marscher
Hi, can you detail the symptom further? Was it that only 12 requests were services and the other 440 timed out? I don't think that Spark is well suited for this kind of workload, or at least the way it is being represented. How long does a single request take Spark to complete? Even with fair sch

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread Richard Marscher
Is spoutLog just a non-spark file writer? If you run that in the map call on a cluster its going to be writing in the filesystem of the executor its being run on. I'm not sure if that's what you intended. On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla wrote: > Running perfectly in local system bu

Re: Executor memory allocations

2015-06-18 Thread Richard Marscher
It would be the "40%", although it's probably better to think of it as shuffle vs. data cache and the remainder goes to tasks. As the comments for the shuffle memory fraction configuration clarify that it will be taking memory at the expense of the storage/data cache fraction: spark.shuffle.memory

Re: append file on hdfs

2015-06-10 Thread Richard Marscher
Hi, if you now want to write 1 file per partition, that's actually built into Spark as *saveAsTextFile*(*path*)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toSt

Re: spark eventLog and history server

2015-06-09 Thread Richard Marscher
Hi, I don't have a complete answer to your questions but: "Removing the suffix does not solve the problem" -> unfortunately this is true, the master web UI only tries to build out a Spark UI from the event logs once, at the time the context is closed. If the event logs are in-progress at this tim

FileOutputCommitter deadlock 1.3.1

2015-06-08 Thread Richard Marscher
Hi, we've been seeing occasional issues in production with the FileOutCommitter reaching a deadlock situation. We are writing our data to S3 and currently have speculation enabled. What we see is that Spark get's a file not found error trying to access a temporary part file that it wrote (part-#2

Re: Deduping events using Spark

2015-06-04 Thread Richard Marscher
I think if you create a bidirectional mapping from AnalyticsEvent to another type that would wrap it and use the nonce as its equality, you could then do something like reduceByKey to group by nonce and map back to AnalyticsEvent after. On Thu, Jun 4, 2015 at 1:10 PM, lbierman wrote: > I'm still

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Richard Marscher
It is possible to start multiple concurrent drivers, Spark dynamically allocates ports per "spark application" on driver, master, and workers from a port range. When you collect results back to the driver, they do not go through the master. The master is mostly there as a coordinator between the dr

Re: Spark Client

2015-06-03 Thread Richard Marscher
I think the short answer to the question is, no, there is no alternate API that will not use the System.exit calls. You can craft a workaround like is being suggested in this thread. For comparison, we are doing programmatic submission of applications in a long-running client application. To get ar

Re: Application is "always" in process when I check out logs of completed application

2015-06-03 Thread Richard Marscher
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed in 1.3.1 and up. https://issues.apache.org/jira/browse/SPARK-6036 The ordering that the event logs are moved from in-progress to complete is coded to be after the Master tries to build the history page for the logs. The onl

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Richard Marscher
Are you sure it's memory related? What is the disk utilization and IO performance on the workers? The error you posted looks to be related to shuffle trying to obtain block data from another worker node and failing to do so in reasonable amount of time. It may still be memory related, but I'm not s

Re: Event Logging to HDFS on Standalone Cluster "In Progress"

2015-06-01 Thread Richard Marscher
Ah, apologies, I found an existing issue and fix has already gone out for this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036. On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher wrote: > It looks like it is possibly a race condition between removing the > IN_PROGRE

Re: Event Logging to HDFS on Standalone Cluster "In Progress"

2015-06-01 Thread Richard Marscher
osed before the dagScheduler? Thanks, Richard On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher wrote: > Hi, > > In Spark 1.3.0 I've enabled event logging to write to an existing HDFS > folder on a Standalone cluster. This is generally working, all the logs are > being written. H

Event Logging to HDFS on Standalone Cluster "In Progress"

2015-06-01 Thread Richard Marscher
Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history: http://xxx.xxx.xxx

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Richard Marscher
mation that FAIR scheduling is > supported for Spark Streaming Apps > > > > *From:* Richard Marscher [mailto:rmarsc...@localytics.com] > *Sent:* Friday, May 15, 2015 7:20 PM > *To:* Evo Eftimov > *Cc:* Tathagata Das; user > *Subject:* Re: Spark Fair Scheduler for Spark St

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Richard Marscher
The doc is a bit confusing IMO, but at least for my application I had to use a fair pool configuration to get my stages to be scheduled with FAIR. On Fri, May 15, 2015 at 2:13 PM, Evo Eftimov wrote: > No pools for the moment – for each of the apps using the straightforward > way with the spark c

Re: Looking inside the 'mapPartitions' transformation, some confused observations

2015-05-11 Thread Richard Marscher
I believe the issue in b and c is that you call iter.size which actually is going to flush the iterator so the subsequent attempt to put it into a vector will yield 0 items. You could use an ArrayBuilder for example and not need to rely on knowing the size of the iterator. On Mon, May 11, 2015 at

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Richard Marscher
I should also add I've recently seen this issue as well when using collect. I believe in my case it was related to heap space on the driver program not being able to handle the returned collection. On Thu, May 7, 2015 at 11:05 AM, Richard Marscher wrote: > By default you would expect

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Richard Marscher
By default you would expect to find the logs files for master and workers in the relative `logs` directory from the root of the Spark installation on each of the respective nodes in the cluster. On Thu, May 7, 2015 at 10:27 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Ø C

Re: Spark Job triggers second attempt

2015-05-07 Thread Richard Marscher
Hi, I think you may want to use this setting?: spark.task.maxFailures4Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1. On Thu, May 7, 2015 at 2:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > How i can stop Spark to

Re: Maximum Core Utilization

2015-05-05 Thread Richard Marscher
Hi, do you have information on how many partitions/tasks the stage/job is running? By default there is 1 core per task, and your number of concurrent tasks may be limiting core utilization. There are a few settings you could play with, assuming your issue is related to the above: spark.default.pa

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Richard Marscher
In regards to the large GC pauses, assuming you allocated all 100GB of memory per worker you may consider running with less memory on your Worker nodes, or splitting up the available memory on the Worker nodes amongst several worker instances. The JVM's garbage collection starts to become very slow

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Richard Marscher
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of comma

Re: Scalability of group by

2015-04-28 Thread Richard Marscher
Hi, I can offer a few ideas to investigate in regards to your issue here. I've run into resource issues doing shuffle operations with a much smaller dataset than 2B. The data is going to be saved to disk by the BlockManager as part of the shuffle and then redistributed across the cluster as releva

Re: Group by order by

2015-04-27 Thread Richard Marscher
exander wrote: > Hi Richard, > > > > There are several values of time per id. Is there a way to perform group > by id and sort by time in Spark? > > > > Best regards, Alexander > > > > *From:* Richard Marscher [mailto:rmarsc...@localytics.com] > *Se

Re: Group by order by

2015-04-27 Thread Richard Marscher
Hi, that error seems to indicate the basic query is not properly expressed. If you group by just ID, then that means it would need to aggregate all the time values into one value per ID, so you can't sort by it. Thus it tries to suggest an aggregate function for time so you can have 1 value per ID

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Richard Marscher
Could you possibly describe what you are trying to learn how to do in more detail? Some basics of submitting programmatically: - Create a SparkContext instance and use that to build your RDDs - You can only have 1 SparkContext per JVM you are running, so if you need to satisfy concurrent job reque

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Richard Marscher
If it fails with sbt-assembly but not without it, then there's always the likelihood of a classpath issue. What dependencies are you rolling up into your assembly jar? On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa wrote: > Any ideas ? > > On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Richard Marscher
Hi, I've gotten an application working with sbt-assembly and spark, thought I'd present an option. In my experience, trying to bundle any of the Spark libraries in your uber jar is going to be a major pain. There will be a lot of deduplication to work through and even if you resolve them it can be