Re: Want to avoid groupByKey as its running for ever

2015-06-30 Thread Daniel Siegmann
If the number of items is very large, have you considered using probabilistic counting? The HyperLogLogPlus class from stream-lib

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in src/ma

Re: Unit tests of spark application

2015-07-10 Thread Daniel Siegmann
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire wrote: > I want to write junit test cases in scala for testing spark application. > Is there any guide or link which I can refer. > https://spark.apache.org/docs/latest/programming-guide.html#unit-testing Typically I create test data using SparkCo

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Daniel Siegmann
In the past I have seen this happen when I filled up HDFS and some core nodes became unhealthy. There was no longer anywhere to replicate the data. >From your command it looks like you should have 1 master and 2 core nodes in your cluster. Can you verify both the core nodes are healthy? On Wed, Ma

Re: Serializing collections in Datasets

2016-03-03 Thread Daniel Siegmann
I have confirmed this is fixed in Spark 1.6.1 RC 1. Thanks. On Tue, Feb 23, 2016 at 1:32 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Yes, I will test once 1.6.1 RC1 is released. Thanks. > > On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust > wrote: > &g

Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
transparently? Any advice would be appreciated. ~Daniel Siegmann

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
e are you using to train the model? If you haven't > tried yet, you should consider the SparseVector > > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector > > > On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann < > d

Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
g.apache.spark.mllib.classification.LogisticRegressionWithLBFGS >> >> Only downside is that you can't use the pipeline framework from spark ml. >> >> Cheers, >> Devin >> >> >> >> On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann < >> dan

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann
for a 20 million > size dense weight vector (which should only be a few 100MB memory), so > perhaps something else is going on. > > Nick > > On Tue, 8 Mar 2016 at 22:55 Daniel Siegmann > wrote: > >> Just for the heck of it I tried the old MLlib implementation, but it had &

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows together have > a few 10,00

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
There are potential > solutions to these but they haven't been implemented as yet. > > On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann > wrote: > >> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath > > wrote: >> >>> Would you mind letting us know the # t

Re: cluster randomly re-starting jobs

2016-03-21 Thread Daniel Siegmann
e if there are multiple attempts. You can also see it in the Spark history server (under incomplete applications, if the second attempt is still running). ~Daniel Siegmann On Mon, Mar 21, 2016 at 9:58 AM, Ted Yu wrote: > Can you provide a bit more information ? > > Release of Spark an

[ML] Training with bias

2016-04-11 Thread Daniel Siegmann
at would just be part of the model. ~Daniel Siegmann

Re: [ML] Training with bias

2016-04-12 Thread Daniel Siegmann
: String = fitIntercept: whether to fit an intercept term (default: > true) > > On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann > wrote: > >> I'm trying to understand how I can add a bias when training in Spark. I >> have only a vague familiarity with this subject, so I

Re: Saving Parquet files to S3

2016-06-09 Thread Daniel Siegmann
I don't believe there's anyway to output files of a specific size. What you can do is partition your data into a number of partitions such that the amount of data they each contain is around 1 GB. On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain wrote: > Hello Team, > > > > I want to write parquet fil

Re: Apache design patterns

2016-06-09 Thread Daniel Siegmann
On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux wrote: > 1. Should I use dataframes to ‘pull the source data? If so, do I do > a groupby and order by as part of the SQL query? > Seems reasonable. If you use Scala you might want to define a case class and convert the data frame to a datase

Re: What are using Spark for

2016-08-02 Thread Daniel Siegmann
Yes, you can use Spark for ETL, as well as feature engineering, training, and scoring. ~Daniel Siegmann On Tue, Aug 2, 2016 at 3:29 PM, Mich Talebzadeh wrote: > Hi, > > If I may say, if you spend sometime going through this mailing list in > this forum and see the variety of topic

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to thinkin

Zip data frames

2015-12-29 Thread Daniel Siegmann
RDD has methods to zip with another RDD or with an index, but there's no equivalent for data frames. Anyone know a good way to do this? I thought I could just convert to RDD, do the zip, and then convert back, but ... 1. I don't see a way (outside developer API) to convert RDD[Row] directly

Re: Too many tasks killed the scheduler

2016-01-12 Thread Daniel Siegmann
As I understand it, your initial number of partitions will always depend on the initial data. I'm not aware of any way to change this, other than changing the configuration of the underlying data store. Have you tried reading the data in several data frames (e.g. one data frame per day), coalescin

Re: Spark 2.0.0 release plan

2016-01-27 Thread Daniel Siegmann
Will there continue to be monthly releases on the 1.6.x branch during the additional time for bug fixes and such? On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers wrote: > thanks thats all i needed > > On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote: > >> I think it will come significantly late

Re: data type transform when creating an RDD object

2016-02-17 Thread Daniel Siegmann
This should do it (for the implementation of your parse method, Google should easily provide information - SimpleDateFormatter is probably what you want): def parseDate(s: String): java.sql.Date = { ... } val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.spli

Re: Is this likely to cause any problems?

2016-02-19 Thread Daniel Siegmann
With EMR supporting Spark, I don't see much reason to use the spark-ec2 script unless it is important for you to be able to launch clusters using the bleeding edge version of Spark. EMR does seem to do a pretty decent job of keeping up to date - the latest version (4.3.0) supports the latest Spark

Serializing collections in Datasets

2016-02-22 Thread Daniel Siegmann
e plans to support serializing arbitrary Seq values in datasets, or must everything be converted to Array? ~Daniel Siegmann

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Daniel Siegmann
During testing you will typically be using some finite data. You want the stream to shut down automatically when that data has been consumed so your test shuts down gracefully. Of course once the code is running in production you'll want it to keep waiting for new records. So whether the stream sh

Re: Serializing collections in Datasets

2016-02-23 Thread Daniel Siegmann
Yes, I will test once 1.6.1 RC1 is released. Thanks. On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann <

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Daniel Siegmann
How many core nodes does your cluster have? On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-37 conf]$ cd /usr/bin/ > [had

Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of partiti

Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Daniel Siegmann
oders? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Access S3 buckets in multiple accounts

2016-09-27 Thread Daniel Siegmann
have access to the S3 bucket in the EMR cluster's AWS account. Is there any way for Spark to access S3 buckets in multiple accounts? If not, is there any best practice for how to work around this? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floo

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Daniel Siegmann
Thanks for the help everyone. I was able to get permissions configured for my cluster so it now has access to the bucket on the other account. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Wed, Sep 28, 2016 at 10:03 AM

Re: UseCase_Design_Help

2016-10-05 Thread Daniel Siegmann
I think it's fine to read animal types locally because there are only 70 of them. It's just that you want to execute the Spark actions in parallel. The easiest way to do that is to have only a single action. Instead of grabbing the result right away, I would just add a column for the animal type a

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Daniel Siegmann
tions. Personally, I would just use a separate JSON library (e.g. json4s) to parse this metadata into an object, rather than trying to read it in through Spark. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet, with the interesting > case

Re: Few questions on reliability of accumulators value.

2016-12-12 Thread Daniel Siegmann
Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulators/ On Sun, Dec 11, 2016 at 11:27 AM, Sudev A C wrote: > Please help. > Anyone, any thought

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
ny way to disable it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic wrote: > Hi, > > I write a randomly generated 30,000-row dataframe to parquet. I verify

Re: Spark #cores

2017-01-18 Thread Daniel Siegmann
I am not too familiar with Spark Standalone, so unfortunately I cannot give you any definite answer. I do want to clarify something though. The properties spark.sql.shuffle.partitions and spark.default.parallelism affect how your data is split up, which will determine the *total* number of tasks,

Re: [Spark] Accumulators or count()

2017-03-01 Thread Daniel Siegmann
As you noted, Accumulators do not guarantee accurate results except in specific situations. I recommend never using them. This article goes into some detail on the problems with accumulators: http://imranrashid.com/posts/Spark-Accumulators/ On Wed, Mar 1, 2017 at 7:26 AM, Charles O. Bajomo < cha

Re: Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Daniel Siegmann
On Wed, Apr 12, 2017 at 4:11 PM, Sam Elamin wrote: > > When it comes to scheduling Spark jobs, you can either submit to an > already running cluster using things like Oozie or bash scripts, or have a > workflow manager like Airflow or Data Pipeline to create new clusters for > you. We went down t

Documentation on "Automatic file coalescing for native data sources"?

2017-05-16 Thread Daniel Siegmann
nd Google was not helpful. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-26 Thread Daniel Siegmann
any reason not to enable it, but I haven't had any problem with it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Sat, May 20, 2017 at 9:14 PM, Kabeer Ahmed wrote: > Thank you Takeshi. > > As far as I s

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> no matter what you do and how many nodes you start, in case you have a > single text file, it will not use parallelism. > This is not true, unless the file is small or is gzipped (gzipped files cannot be split).

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) > text file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Also, in case of S3 or NFS, how does the input split > work? I understand for HDFS files are already pre-split so Spark can us

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta wrote: > > I will be very surprised if someone tells me that a 1 GB CSV text file is > automatically split and read by multiple executors in SPARK. It does not > matter whether it stays in HDFS, S3 or any other system. > I can't speak to *any* sys

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
-configuration-options I have no idea why it defaults to a fixed 200 (while default parallelism defaults to a number scaled to your number of cores), or why there are two separate configuration properties. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
ons there are, you will need to coalesce or repartition. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com wrote: > Thanks Daniel! > > I've been wondering that f

Re: Filtering keys after map+combine

2015-02-19 Thread Daniel Siegmann
etwork shuffle, in reduceByKey after map + > combine are done, I would like to filter the keys based on some threshold... > > Is there a way to get the key, value after map+combine stages so that I > can run a filter on the keys ? > > Thanks. > Deb > -- Daniel Siegmann,

Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
I thought removing the alpha tag just meant the API was stable? Speaking of which, aren't there major changes to the API coming in 1.3? Why are you marking the API as stable before these changes have been widely used? On Sat, Feb 28, 2015 at 5:17 PM, Michael Armbrust wrote: > We are planning to

Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
OK, good to know data frames are still experimental. Thanks Michael. On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust wrote: > We have been using Spark SQL in production for our customers at Databricks > for almost a year now. We also know of some very large production > deployments elsewhere.

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread Daniel Siegmann
An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify a

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs w

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-13 Thread Daniel Siegmann
On Thu, Mar 12, 2015 at 1:45 AM, wrote: > > In your response you say “When you call reduce and *similar *methods, > each partition can be reduced in parallel. Then the results of that can be > transferred across the network and reduced to the final result”. By similar > methods do you mean all ac

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika wrote: > Hi,

Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann
I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've run into some hurdles trying to unit test this, and I'm wondering if anyone has a good approach. The metastore DB is automatically created in

Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
tastorePath;create=true") > setConf("hive.metastore.warehouse.dir", warehousePath.toString) > } > > Cheers > > On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann < > daniel.siegm...@teamaol.com> wrote: > >> I am trying to unit test some code which takes

Re: How can I dispose an Accumulator?

2014-06-04 Thread Daniel Siegmann
e: > > Hi, > > > > > > > > How can I dispose an Accumulator? > > > > It has no method like 'unpersist()' which Broadcast provides. > > > > > > > > Thanks. > > > > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
12:22 AM, Patrick Wendell >> wrote: >> >> We can just add back a flag to make it backwards compatible - it was >> >> just missed during the original PR. >> >> >> >> Adding a *third* set of "clobber" semantics, I'm slightly -1 o

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
t; On Thursday, June 12, 2014 at 2:39 PM, Daniel Siegmann wrote: > > The old behavior (A) was dangerous, so it's good that (B) is now the > default. But in some cases I really do want to replace the old data, as per > (C). For example, I may rerun a previous computation (perhaps

Re: Not fully cached when there is enough memory

2014-06-12 Thread Daniel Siegmann
; ones out when there is not enough memory. I saw similar glitches but >> the storage info per partition is correct. If you find a way to >> reproduce this error, please create a JIRA. Thanks! -Xiangrui >> > > -- Daniel Siegmann, Software Developer Velos Accelerating M

Re: Question about RDD cache, unpersist, materialization

2014-06-12 Thread Daniel Siegmann
t >> already unpersisted. >> // So, rebuilding all 10 rdds will occur. >> rddUnion.saveAsTextFile(mergedFileName); >> } >> >> If rddUnion can be materialized before the rdd.unpersist() line and >> cache()d, the rdds in the loop will not be needed on >

Re: guidance on simple unit testing with Spark

2014-06-16 Thread Daniel Siegmann
"A GetInfo job" should { > > //* How do I pass "data" define above as input and output > > which GetInfo expects as arguments? ** > > val sc = new SparkContext("local", "GetInfo") > > > >

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Daniel Siegmann
>>>>>> On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas < >>>>>> nicholas.cham...@gmail.com> wrote: >>>>>> >>>>>>> What do you get for rdd1._jrdd.splits().size()? You might think >>>>>>> you’re getting > 100 par

Re: Map with filter on JavaRdd

2014-06-27 Thread Daniel Siegmann
st archive at Nabble.com. > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann
t; > > Thank you, > Konstantin Kudryavtsev > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
ectronic communications with Accenture and its affiliates, > including e-mail and instant messaging (including content), may be scanned > by our systems for the purposes of information security and assessment of > internal compliance with Accenture policy. > > _

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
;> "Scalding". It's built on top of Cascading. If you have a huge dataset or >> if you consider using map/reduce engine for your job, for any reason, you >> can try Scalding. >> > > PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
t; > > Daniel, > > Do you mind sharing the size of your cluster and the production data > volumes ? > > Thanks > Soumya > > On Jul 7, 2014, at 3:39 PM, Daniel Siegmann > wrote: > > From a development perspective, I vastly prefer Spark to MapReduce. The > Ma

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
. Each node had 24 cores and > 2 workers each. Each executor got 14 GB of memory. > > -Suren > > > > On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey > wrote: > >> When you say "large data sets", how large? >> Thanks >> >> >> On 07/07/2014 0

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
GB of memory. >> >> -Suren >> >> >> >> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey >> wrote: >> >>> When you say "large data sets", how large? >>> Thanks >>> >>> >>> On 07/07/2014 01:39 PM, Dan

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Daniel Siegmann
From the data > injector and "Streaming" tab of web ui, it's running well. > > However, I see quite a lot of Active stages in web ui even some of them > have all of their tasks completed. > > I attach a screenshot for your reference. > > Do you ever see this k

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Daniel Siegmann
>> Thanks, >> Rahul Kumar Bhojwani >> 3rd year, B.Tech >> Computer Science Engineering >> National Institute Of Technology, Karnataka >> 9945197359 >> > > > > -- > Rahul K Bhojwani > 3rd Year B.Tech > Computer Science and Engineer

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
e(# nodes) seems to just allocate > one task per core, and so runs out of memory on the node. Is there any way > to give the scheduler a hint that the task uses lots of memory and cores so > it spreads it out more evenly? > > Thanks, > > Ravi Pandya > Microsoft Research >

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
he *only* thing you run on the cluster, you could also > configure the Workers to only report one core by manually launching the > spark.deploy.worker.Worker process with that flag (see > http://spark.apache.org/docs/latest/spark-standalone.html). > > Matei > > On Jul 14, 2014,

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Daniel Siegmann
(x,y) => x+y).collect >> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), >> (P(bob),1), (P(abe),1), (P(charly),1)) >> >> In contrast to the expected behavior, that should be equivalent to: >> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
e-tp10617.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Daniel Siegmann
>> > Context for JUnit >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Daniel Siegmann
7;filter' and the default is the total number of cores available. > > I'm fairly new with Spark so maybe I'm just missing or misunderstanding > something fundamental. Any help would be appreciated. > > Thanks. > > Darin. > > -- Daniel Siegmann, Software Devel

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Daniel Siegmann
n what the > documentation states). What would I want that value to be based on my > configuration below? Or, would I leave that alone? > > -- > *From:* Daniel Siegmann > *To:* user@spark.apache.org; Darin McBeath > *Sent:* Wednesday, July 30, 2014 5

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
; > ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 -t > m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2 > *my-cluster* > > > ------ > *From:* Daniel Siegmann > *To:* Darin McBeath > *Cc:* Daniel Siegm

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
urrent tasks you can execute at one time. If you want more parallelism, > I think you just need more cores in your cluster--that is, bigger nodes, or > more nodes. > > Daniel, > > Have you been able to get around this limit? > > Nick > > > > On Fri, Aug 1, 2014 at 11

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
--- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
st mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Develope

Re: heterogeneous cluster hardware

2014-08-21 Thread Daniel Siegmann
nal commands, e-mail: [hidden email] > >> > > > > > > > > If you reply to this email, your message will be added to the discussion > > below: > > > http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluste

Re: Development environment issues

2014-08-25 Thread Daniel Siegmann
? > sbt or maven? > eclipse or idea? > jdk7 or 8? > I'm using Java 7 and Scala 2.10.x (not every framework I use supports later versions). SBT because I use the Play Framework, but I miss Maven. I haven't tried IntelliJ's Scala support, but it's probably worth a shot.

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
or Hadoop is needed or mandatory for using Spark? that's not the > understanding I've. My understanding is that you can use spark with Hadoop > if you like from yarn2 but you could use spark standalone also without > hadoop. > > Please assist. I'm confused ! > > -Sanjeev > > > -

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
ate-results-tp13062.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
------ > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Filter function problem

2014-09-09 Thread Daniel Siegmann
park User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH A

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
sage from your system. > This message and any attachments may contain information that is > confidential, privileged or exempt from disclosure. Delivery of this > message to any person other than the intended recipient is not intended to > waive any right or privilege. Message transmission is not guaranteed to be > secure or free of software viruses. > > *** > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: mappartitions data size

2014-09-26 Thread Daniel Siegmann
scr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to do operations on multiple RDD's

2014-09-26 Thread Daniel Siegmann
e an array of maps with values as keys and frequency as > values. > > Essentially I want something like zipPartitions but for arbitrarily many > RDD's, is there any such functionality or how would I approach this problem? > > Cheers, > > Johan > -- Daniel

Re: about partition number

2014-09-29 Thread Daniel Siegmann
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
, please kindly > reply to the sender indicating this fact and delete all copies of it from > your computer and network server immediately. Your cooperation is highly > appreciated. It is advised that any unauthorized use of confidential > information of Winbond is strictly prohibited; and any i

Re: Spark inside Eclipse

2014-10-02 Thread Daniel Siegmann
> I am running Eclipse Kepler on a Macbook Pro with Mavericks >> Like one can run hadoop map/reduce applications from within Eclipse and >> debug and learn. >> >> thanks >> >> sanjay >> > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Play framework

2014-10-16 Thread Daniel Siegmann
ou have figured out how to build and run a Play app with Spark-submit, > I would appreciate if you could share the steps and the sbt settings for > your Play app. > > > > Thanks, > > Mohammed > > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
nce{ > inAnyOrder{ > (sparkContext.broadcast[DatasetLoader] > _).expects(trainingDatasetLoader).returns(broadcastTrainingDatasetLoader) > } > } > > val sparkInvoker = new SparkJobInvoker(sparkContext, > trainingDatasetLoader) > > when(inputRDD.mapPar

Re: SparkContext.stop() ?

2014-10-31 Thread Daniel Siegmann
> Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >

Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
d D as parquet files. > > > > I'm wondering if spark can restore B and D from the parquet files using a > > customized persist and restore procedure? > > > > > > > > > > ----- > To

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
zes without destroying the RDD for sibsequent > processing. persist will do this but these are big and perisist seems > expensive and I am unsure of which StorageLevel is needed, Is there a way > to clone a JavaRDD or does anyong have good ideas on how to do this? > -- Dan

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
; I tried to lookup online but haven't found any pointers so far. > > > Thanks > pala > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

  1   2   >