Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread mark
ark though and it might need some more manual tweaks. ThanksShivaram   On Fri, Jun 26, 2015 at 9:59 AM, Mark Stephenson wrote: Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster?  I noticed that it appeared your browser that you were using EC2 at that time, so

Saving RDDs in Tachyon

2015-10-22 Thread mark
I have Avro records stored in Parquet files in HDFS. I want to read these out as an RDD and save that RDD in Tachyon for any spark job that wants the data. How do I save the RDD in Tachyon? What format do I use? Which RDD 'saveAs...' method do I want? Thanks

How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread mark
Hi All I need to be able to create, submit and report on Spark jobs programmatically in response to events arriving on a Kafka bus. I also need end-users to be able to create data queries that launch Spark jobs 'behind the scenes'. I would expect to use the same API for both, and be able to provi

Using unserializable classes in tasks

2015-08-14 Thread mark
I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is the

Partitions with zero records & variable task times

2015-09-04 Thread mark
I am trying to tune a Spark job and have noticed some strange behavior - tasks in a stage vary in execution time, ranging from 2 seconds to 20 seconds. I assume tasks should all run in roughly the same amount of time in a well tuned job. So I did some investigation - the fast tasks appear to have

Re: Partitions with zero records & variable task times

2015-09-08 Thread mark
m partitioner for the keys so that they will get evenly > distributed across tasks > > Thanks > Best Regards > > On Fri, Sep 4, 2015 at 7:19 PM, mark wrote: > >> I am trying to tune a Spark job and have noticed some strange behavior - >> tasks in a stage vary i

Re: Spark summit Asia

2015-09-09 Thread mark
http://www.stratahadoopworld.com/singapore/index.html On 8 Sep 2015 8:35 am, "Kevin Jung" wrote: > Is there any plan to hold Spark summit in Asia? > I'm very much looking forward to it. > > Kevin > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spar

Re: Partitions with zero records & variable task times

2015-09-09 Thread mark
is post here has a bit information > http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/ > > Thanks > Best Regards > > On Wed, Sep 9, 2015 at 6:44 AM, mark wrote: > >> As I understand things (maybe naively

Spark DataFrame Creation

2020-07-22 Thread Mark Bidewell
7;m confused by the behavior. my understanding was that load() was lazily executed on the Spark worker. Why would some elements be executing on the driver? Thanks for your help -- Mark Bidewell http://www.linkedin.com/in/markbidewell

dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello, I noticed that the apache/spark-py image for Spark's 3.4.1 release is not available (apache/spark@3.4.1 is available). Would it be possible to get the 3.4.1 release build for the apache/spark-py image published? Thanks, Mark -- This communication, together wit

Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Mark Hamstra
This discussion belongs on the dev list. Please post any replies there. On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park wrote: > Hi, > > I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to > confirm whether these are bugs or not before opening a jira. > > *1)* I can no longer

Re: Spark error "value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]"

2015-06-08 Thread Mark Hamstra
Correct; and PairRDDFunctions#join does still exist in versions of Spark that do have DataFrame, so you don't necessarily have to use DataFrame to do this even then (although there are advantages to using the DataFrame approach.) Your basic problem is that you have an RDD of tuples, where each tup

Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
That would constitute a major change in Spark's architecture. It's not happening anytime soon. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote: > Possibly in future, if and when spark architecture allows workers to > launch spark jobs (the functions passed to transformation or action APIs o

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
Correct. Trading away scalability for increased performance is not an option for the standard Spark API. On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > It would be even faster to load the data on the driver and sort it there > without using Spark :).

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
t; Collect() would involve gathering all the data on a single machine as well. > > Thanks, > Raghav > > On Tuesday, June 9, 2015, Mark Hamstra wrote: > >> Correct. Trading away scalability for increased performance is not an >> option for the standard Spark API. >

ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
wondering if there's a more efficient way. Also posted on SO: http://stackoverflow.com/q/30785615/2687324 Thanks, Mark

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
Makes sense – I suspect what you suggested should work. However, I think the overhead between this and using `String` would be similar enough to warrant just using `String`. Mark From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: June-11-15 12:58 PM To: Mark Tse Cc: user@spark.apache.org

Re: Fully in-memory shuffles

2015-06-11 Thread Mark Hamstra
> > I would guess in such shuffles the bottleneck is serializing the data > rather than raw IO, so I'm not sure explicitly buffering the data in the > JVM process would yield a large improvement. Good guess! It is very hard to beat the performance of retrieving shuffle outputs from the OS buffe

Unit Testing Spark Transformations/Actions

2015-06-16 Thread Mark Tse
q/30871109/2687324). Has anyone experienced this error before while unit testing? Thanks, Mark

RE: Intermedate stage will be cached automatically ?

2015-06-17 Thread Mark Tse
I think https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence might shed some light on the behaviour you’re seeing. Mark From: canan chen [mailto:ccn...@gmail.com] Sent: June-17-15 5:57 AM To: spark users Subject: Intermedate stage will be cached automatically ? Here&#

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Mark Stephenson
hed and utilized SparkR 1.4.0 in this way, with > RStudio Server running on top of the master instance? Are we on the right > track, or should we manually launch a cluster and attempt to connect to it > from another instance running R? > > Thank you in advance! > > Mark

Re: Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Mark Hamstra
Do you want to transform the RDD, or just produce some side effect with its contents? If the latter, you want foreachPartition, not mapPartitions. On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > In rdd.mapPartition(…) if I try to iterate through

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Mark Hamstra
No. He is collecting the results of the SQL query, not the whole dataset. The REPL does retain references to prior results, so it's not really the best tool to be using when you want no-longer-needed results to be automatically garbage collected. On Mon, Jun 29, 2015 at 9:13 AM, ayan guha wrote:

Re: Fair scheduler pool details

2016-03-02 Thread Mark Hamstra
PM, Eugene Morozov wrote: > Mark, > > I'm trying to configure spark cluster to share resources between two pools. > > I can do that by assigning minimal shares (it works fine), but that means > specific amount of cores is going to be wasted by just being ready to run > anyth

Re: Understanding the Web_UI 4040

2016-03-07 Thread Mark Hamstra
There's probably nothing wrong other than a glitch in the reporting of Executor state transitions to the UI -- one of those low-priority items I've been meaning to look at for awhile On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal wrote: > Maybe check the worker logs to see what's going wrong w

Re: Spark on RAID

2016-03-08 Thread Mark Hamstra
One issue is that RAID levels providing data replication are not necessary since HDFS already replicates blocks on multiple nodes. On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov wrote: > Parallel disk IO? But the effect should be less noticeable compared to > Hadoop which reads/writes a lot. Much

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
> > For example, if you're looking to scale out to 1000 concurrent requests, > this is 1000 concurrent Spark jobs. This would require a cluster with 1000 > cores. This doesn't make sense. A Spark Job is a driver/DAGScheduler concept without any 1:1 correspondence between Worker cores and Jobs.

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
path. On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly wrote: > you are correct, mark. i misspoke. apologies for the confusion. > > so the problem is even worse given that a typical job requires multiple > tasks/cores. > > i have yet to see this particular architecture work in prod

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
It's not just if the RDD is explicitly cached, but also if the map outputs for stages have been materialized into shuffle files and are still accessible through the map output tracker. Because of that, explicitly caching RDD actions often gains you little or nothing, since even without a call to c

Re: Spark UI Completed Jobs

2016-03-15 Thread Mark Hamstra
g may not be 100% accurate and bug free. On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph wrote: > Okay, so out of 164 stages, is 163 are skipped. And how 41405 tasks are > skipped if the total is only 19788. > > On Wed, Mar 16, 2016 at 6:31 AM, Mark Hamstra > wrote: > >>

Re: Error using collectAsMap() in scala

2016-03-20 Thread Mark Hamstra
You're not getting what Ted is telling you. Your `dict` is an RDD[String] -- i.e. it is a collection of a single value type, String. But `collectAsMap` is only defined for PairRDDs that have key-value pairs for their data elements. Both a key and a value are needed to collect into a Map[K, V].

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
ipe for how the computation should be done when it is needed. Neither does the "called before any job" comment pose any restriction in this case since no jobs have yet been executed on the RDD. On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu wrote: > See the doc for checkpoint: > >* M

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Mark Hamstra
aveXXX action is the only action being performed on the RDD, the rest of the chain being purely transformations, then checkpointing instead of saving still wouldn't execute any action on the RDD -- it would just mark the point at which checkpointing should be done when an action is eventually run.

Re: No active SparkContext

2016-03-24 Thread Mark Hamstra
You seem to be confusing the concepts of Job and Application. A Spark Application has a SparkContext. A Spark Application is capable of running multiple Jobs, each with its own ID, visible in the webUI. On Thu, Mar 24, 2016 at 6:11 AM, Max Schmidt wrote: > Am 24.03.2016 um 10:34 schrieb Simon

Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
Yes and no. The idea of n-tier architecture is about 20 years older than Spark and doesn't really apply to Spark as n-tier was original conceived. If the n-tier model helps you make sense of some things related to Spark, then use it; but don't get hung up on trying to force a Spark architecture in

Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
ue, Mar 29, 2016 at 5:44 PM, Mich Talebzadeh wrote: > Hi Mark, > > I beg I agree to differ on the interpretation of N-tier architecture. > Agreed that 3-tier and by extrapolation N-tier have been around since days > of client-server architecture. However, they are as valid today as

Re: Executor shutdown hooks?

2016-04-06 Thread Mark Hamstra
Why would the Executors shutdown when the Job is terminated? Executors are bound to Applications, not Jobs. Furthermore, unless spark.job.interruptOnCancel is set to true, canceling the Job at the Application and DAGScheduler level won't actually interrupt the Tasks running on the Executors. If

Re: Spark GUI, Workers and Executors

2016-04-09 Thread Mark Hamstra
https://spark.apache.org/docs/latest/cluster-overview.html On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar wrote: > On Spark GUI I can see the list of Workers. > > I always understood that workers are used by executors. > > What is the relationship between workers and executors please. Is it one >

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mark Hamstra
That's also available in standalone. On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov wrote: > Spark on Yarn supports dynamic resource allocation > > So, you can run several spark-shells / spark-submits / spark-jobserver / > zeppelin on one cluster without defining upfront how many executor

Re: Apache Flink

2016-04-17 Thread Mark Hamstra
To be fair, the Stratosphere project from which Flink springs was started as a collaborative university research project in Germany about the same time that Spark was first released as Open Source, so they are near contemporaries rather than Flink having been started only well after Spark was an es

EMR Spark Custom Metrics

2016-04-21 Thread Mark Kelly
Hi, So i would like some custom metrics. The environment we use is AWS EMR 4.5.0 with spark 1.6.1 and Ganglia. the code snippit below shows how we register custom metrics (this worked in EMR 4.2.0 with spark 1.5.2) package org.apache.spark.metrics.source import com.codahale.metrics._ import o

Re: DeepSpark: where to start

2016-05-05 Thread Mark Vervuurt
Wel you got me fooled as wel ;) Had it on my todolist to dive into this new component... Mark > Op 5 mei 2016 om 07:06 heeft Derek Chan het volgende > geschreven: > > The blog post is a April Fool's joke. Read the last line in the post: > > https://databricks.com/blog/

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
You appear to be misunderstanding the nature of a Stage. Individual transformation steps such as `map` do not define the boundaries of Stages. Rather, a sequence of transformations in which there is only a NarrowDependency between each of the transformations will be pipelined into a single Stage.

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
irav Patel wrote: > Hi Mark, > > I might have said stage instead of step in my last statement "UI just > says Collect failed but in fact it could be any stage in that lazy chain of > evaluation." > > Anyways even you agree that this visibility of underlaying ste

Re: What is the interpretation of Cores in Spark doc

2016-06-13 Thread Mark Hamstra
I don't know what documentation you were referring to, but this is clearly an erroneous statement: "Threads are virtual cores." At best it is terminology abuse by a hardware manufacturer. Regardless, Spark can't get too concerned about how any particular hardware vendor wants to refer to the spec

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
d >> applications. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABU

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
‘logical’ processors vs. cores and POSIX threaded >> applications. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
kedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 16 June 2016 at 19:07, Mark Hamstra wrot

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Hi Mich, sorry for bothering did you manage to solve your problem? We have a similar problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an Oracle Database. Thanks, Mark > On 12 Feb 2016, at 11:45, Mich Talebzadeh <mailto:m...@peridale.co.uk>> wrote: > > H

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Thanks Mich, we have got it working using the example here under ;) Mark > On 11 Jul 2016, at 09:45, Mich Talebzadeh wrote: > > Hi Mark, > > Hm. It should work. This is Spark 1.6.1 on Oracle 12c > > > scala> val HiveContext = new org.apache.spark.sql.hive.Hi

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-15 Thread Mark Hamstra
Nothing has changed in that regard, nor is there likely to be "progress", since more sophisticated or capable resource scheduling at the Application level is really beyond the design goals for standalone mode. If you want more in the way of multi-Application resource scheduling, then you should be

Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Mark Hamstra
Don't use Spark 2.0.0-preview. That was a preview release with known issues, and was intended to be used only for early, pre-release testing purpose. Spark 2.0.0 is now released, and you should be using that. On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca wrote: > and, of course I am using > >

Re: how to order data in descending order in spark dataset

2016-07-30 Thread Mark Wusinich
> ts.groupBy("b").count().orderBy(col("count"), ascending=False) Sent from my iPhone > On Jul 30, 2016, at 2:54 PM, Don Drake wrote: > > Try: > > ts.groupBy("b").count().orderBy(col("count").desc()); > > -Don > >> On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane wrote: >> just to clarify I am try

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
What are you expecting to find? There currently are no releases beyond Spark 2.0.0. On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma wrote: > If we want to use versions of Spark beyond the official 2.0.0 release, > specifically on Maven + Java, what steps should we take to upgrade? I can't > find the

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
imer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damag

Re: RE: Spark assembly in Maven repo?

2015-12-10 Thread Mark Hamstra
No, publishing a spark assembly jar is not fine. See the doc attached to https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a likely goal of Spark 2.0 will be the elimination of assemblies. On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com wrote: > Using maven to download the

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Mark Hamstra
It can be used, and is used in user code, but it isn't always as straightforward as you might think. This is mostly because a Job often isn't a Job -- or rather it is more than one Job. There are several RDD transformations that aren't lazy, so they end up launching "hidden" Jobs that you may not

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-17 Thread Mark Hamstra
ing cancellation, or more extensive reuse functionality as in https://issues.apache.org/jira/browse/SPARK-11838 If you don't want to spend a lot of time looking at Job cancellation issues, best to back away now! :) On Wed, Dec 16, 2015 at 4:26 PM, Jacek Laskowski wrote: > Thanks Mark f

Spark Path Wildcards Question

2015-12-17 Thread Mark Vervuurt
files with “sparkContext.sequenceFile('hdfs:///data/*/*/*/*.seq')” correct? Looking at our results it seems to be working fine and as described above. Thanks, Mark

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I don't understand. If you're using fair scheduling and don't set a pool, the default pool will be used. On Tue, Jan 5, 2016 at 1:57 AM, Jeff Zhang wrote: > > It seems currently spark.scheduler.pool must be set as localProperties > (associate with thread). Any reason why spark.scheduler.pool ca

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
ant is the default pool is fair > scheduling. But seems if I want to use fair scheduling now, I have to set > spark.scheduler.pool explicitly. > > On Wed, Jan 6, 2016 at 2:03 AM, Mark Hamstra > wrote: > >> I don't understand. If you're using fair scheduling an

Re: Can spark.scheduler.pool be applied globally ?

2016-01-05 Thread Mark Hamstra
I can override the root pool in configuration file, Thanks Mark. > > On Wed, Jan 6, 2016 at 8:45 AM, Mark Hamstra > wrote: > >> Just configure with >> FAIR in fairscheduler.xml (or >> in spark.scheduler.allocation.file if you have over-riden the default name >>

Re: spark 1.6 Issue

2016-01-06 Thread Mark Hamstra
It's not a bug, but a larger heap is required with the new UnifiedMemoryManager: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172 On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Hi All

Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
Same SparkContext means same pool of Workers. It's up to the Scheduler, not the SparkContext, whether the exact same Workers or Executors will be used to calculate simultaneous actions against the same RDD. It is likely that many of the same Workers and Executors will be used as the Scheduler tri

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
-dev What do you mean by JobContext? That is a Hadoop mapreduce concept, not Spark. On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou wrote: > Dear all, > > Is there a way to reuse executor JVM across different JobContexts? Thanks. > > Best Regards, > Jia >

Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
ice surprise to me > > On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra > wrote: > >> Same SparkContext means same pool of Workers. It's up to the Scheduler, >> not the SparkContext, whether the exact same Workers or Executors will be >> used to calculate simultaneou

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
a Zou wrote: > Hi, Mark, sorry, I mean SparkContext. > I mean to change Spark into running all submitted jobs (SparkContexts) in > one executor JVM. > > Best Regards, > Jia > > On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra > wrote: > >> -dev >> >&

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
, Jan 17, 2016 at 1:15 PM, Jia wrote: > Hi, Mark, sorry for the confusion. > > Let me clarify, when an application is submitted, the master will tell > each Spark worker to spawn an executor JVM process. All the task sets of > the application will be executed by the exec

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
Yes, that is one of the basic reasons to use a jobserver/shared-SparkContext. Otherwise, in order share the data in an RDD you have to use an external storage system, such as a distributed filesystem or Tachyon. On Sun, Jan 17, 2016 at 1:52 PM, Jia wrote: > Thanks, Mark. Then, I gu

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Mark Hamstra
What do you think is preventing you from optimizing your own RDD-level transformations and actions? AFAIK, nothing that has been added in Catalyst precludes you from doing that. The fact of the matter is, though, that there is less type and semantic information available to Spark from the raw RDD

Re: Spark 2.0.0 release plan

2016-01-29 Thread Mark Hamstra
https://github.com/apache/spark/pull/10608 On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote: > I'm not an authoritative source but I think it is indeed the plan to > move the default build to 2.11. > > See this discussion for more detail > > http://apache-spark-developers-list.1001551.n3.na

Can't view executor logs in web UI on Windows

2016-02-01 Thread Mark Pavey
I am running Spark on Windows. When I try to view the Executor logs in the UI I get the following error: HTTP ERROR 500 Problem accessing /logPage/. Reason: Server Error Caused by: java.net.URISyntaxException: Illegal character in path at index 1: .\work/app-20160129154716-0038/2/ a

RE: Can't view executor logs in web UI on Windows

2016-02-05 Thread Mark Pavey
We have created JIRA ticket https://issues.apache.org/jira/browse/SPARK-13142 and will submit a pull request next week. Mark -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: 01 February 2016 14:24 To: Mark Pavey Cc: user@spark.apache.org Subject: Re: Can't

RE: Can't view executor logs in web UI on Windows

2016-02-09 Thread Mark Pavey
I have submitted a pull request: https://github.com/apache/spark/pull/11135. Mark -Original Message- From: Mark Pavey [mailto:mark.pa...@thefilter.com] Sent: 05 February 2016 17:09 To: 'Ted Yu' Cc: user@spark.apache.org Subject: RE: Can't view executor logs in web UI

Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Mark Hamstra
Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_72) Type in expressions to have them evaluated. Type :he

Re: Fair scheduler pool details

2016-02-20 Thread Mark Hamstra
It's 2 -- and it's pretty hard to point to a line of code, a method, or even a class since the scheduling of Tasks involves a pretty complex interaction of several Spark components -- mostly the DAGScheduler, TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as well as the Sche

Re: SQL queries in Spark / YARN

2015-09-28 Thread Mark Hamstra
Yes. On Mon, Sep 28, 2015 at 12:46 PM, Robert Grandl wrote: > Hi guys, > > I was wondering if it's possible to submit SQL queries to Spark SQL, when > Spark is running atop YARN instead of standalone mode. > > Thanks, > Robert >

Re: Worker node timeout exception

2015-10-01 Thread Mark Luk
Here is the log file from the worker node 15/09/30 23:49:37 INFO Worker: Executor app-20150930233113-/8 finished with state EXITED message Command exited with code 1 exitStatus \ 1 15/09/30 23:49:37 INFO Worker: Asked to launch executor app-20150930233113-/9 for PythonPi 15/09/30 23:49:37

Spark-Testing-Base Q/A

2015-10-21 Thread Mark Vervuurt
Hi Everyone, I am busy trying out ‘Spark-Testing-Base <https://github.com/holdenk/spark-testing-base>’. I have the following questions? Can you test Spark Streaming Jobs using Java? Can I use Spark-Testing-Base 1.3.0_0.1.1 together with Spark 1.3.1? Thanks. Greetings, Mark

Re: Spark-Testing-Base Q/A

2015-10-21 Thread Mark Vervuurt
Hi Holden, Thanks for the information, I think that a Java Base Class in order to test SparkStreaming using Java would be useful for the community. Unfortunately not all of our customers are willing to use Scala or Python. If i am not wrong it’s 4:00 AM for you in California ;) Regards, Mark

Apache Spark on Raspberry Pi Cluster with Docker

2015-10-28 Thread Mark Bonnekessel
.1", "submissionId" : "driver-20151028115443-0002", "success" : true } The driver is created correctly, but it never starts the application. What am i missing? Regards, Mark

Apache Spark on Raspberry Pi Cluster with Docker

2015-10-28 Thread Mark Bonnekessel
.1", "submissionId" : "driver-20151028115443-0002", "success" : true } The driver is created correctly, but it never starts the application. What am i missing? Regards, Mark

Re: foreachPartition

2015-10-30 Thread Mark Hamstra
The closure is sent to and executed an Executor, so you need to be looking at the stdout of the Executors, not on the Driver. On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I'm just trying to do some operation inside foreachPartition, but I can't > even

Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Mark Hamstra
Hah! No, that is not a "starter" issue. It touches on some fairly deep Spark architecture, and there have already been a few attempts to resolve the issue -- none entirely satisfactory, but you should definitely search out the work that has already been done. On Mon, Nov 2, 2015 at 5:51 AM, Jace

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workar

Re: Slow stage?

2015-11-11 Thread Mark Hamstra
Those are from the Application Web UI -- look for the "DAG Visualization" and "Event Timeline" elements on Job and Stage pages. On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky wrote: > Hi Simone, > I'm afraid I don't have an answer to your question. However I noticed the > DAG figures in the att

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Mark Hamstra
> > In the near future, I guess GUI interfaces of Spark will be available > soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all. > They can analyze their data by clicking a few buttons, instead of writing > the programs. : ) That's not in the future. :) On Mon, Nov 23, 201

Re: Spark on yarn vs spark standalone

2015-11-30 Thread Mark Hamstra
Standalone mode also supports running the driver on a cluster node. See "cluster" mode in http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications . Also, http://spark.apache.org/docs/latest/spark-standalone.html#high-availability On Mon, Nov 30, 2015 at 9:47 AM, Ja

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
> > It is not designed for interactive queries. You might want to ask the designers of Spark, Spark SQL, and particularly some things built on top of Spark (such as BlinkDB) about their intent with regard to interactive queries. Interactive queries are not the only designed use of Spark, but it

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
ncy it may > not be a good fit. > > M > > On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi wrote: > > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
some other solution. On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi wrote: > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra > wrote: > >> It is no

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Mark Hamstra
Where it could start to make some sense is if you wanted a single application to be able to work with more than one Spark cluster -- but that's a pretty weird or unusual thing to do, and I'm pretty sure it wouldn't work correctly at present. On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust wrote

Re: TCP/IP speedup

2015-08-01 Thread Mark Hamstra
https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote: > Hi All! > > How important would be a significant performance improvement to TCP/IP > itself, in terms of > overall job performance improvement. Which part would be most

Re: Set Job Descriptions for Scala application

2015-08-05 Thread Mark Hamstra
SparkContext#setJobDescription or SparkContext#setJobGroup On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica wrote: > Hello, > > My Spark application is written in Scala and submitted to a Spark cluster > in standalone mode. The Spark Jobs for my application are listed in the > Spark UI like this:

What is the Effect of Serialization within Stages?

2015-08-12 Thread Mark Heimann
for piping the result of the first *map* operation as a parameter into the following *map* operation? Any ideas and feedback appreciated, thanks a lot. Best regards, Mark

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Mark Heimann
Thanks a lot guys, that's exactly what I hoped for :-). Cheers, Mark 2015-08-13 6:35 GMT+02:00 Hemant Bhanawat : > A chain of map and flatmap does not cause any > serialization-deserialization. > > > > On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann > wrote: >

[mllib] Random forest maxBins and confidence in training points

2015-08-18 Thread Mark Alen
Hi everyone,  I have two questions regarding the random forest implementation in mllib 1- maxBins: Say the value of a feature is between [0,100]. In my dataset there are a lot of data points between [0,10] and one datapoint at 100 and nothing between (10, 100). I am wondering how does the binning

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Mark Hamstra
There is the Async API ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala), which makes use of FutureAction ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala). You could als

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Mark Hamstra
Where do you see a race in the DAGScheduler? On a quick look at your stack trace, this just looks to me like a Job where a Stage failed and then the DAGScheduler aborted the failed Job. On Thu, Sep 24, 2015 at 12:00 PM, robin_up wrote: > Hi > > After upgrade to 1.5, we found a possible racing c

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit the code/data duality so that the non-distributed data that you are sending out from the driver is actually paying a role more like code (or at least parameters.) What is sent from the driver to an Executer is then used (t

  1   2   3   >