Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
ons there are, you will need to coalesce or repartition. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com wrote: > Thanks Daniel! > > I've been wondering that f

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
-configuration-options I have no idea why it defaults to a fixed 200 (while default parallelism defaults to a number scaled to your number of cores), or why there are two separate configuration properties. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta wrote: > > I will be very surprised if someone tells me that a 1 GB CSV text file is > automatically split and read by multiple executors in SPARK. It does not > matter whether it stays in HDFS, S3 or any other system. > I can't speak to *any* sys

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) > text file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Also, in case of S3 or NFS, how does the input split > work? I understand for HDFS files are already pre-split so Spark can us

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> no matter what you do and how many nodes you start, in case you have a > single text file, it will not use parallelism. > This is not true, unless the file is small or is gzipped (gzipped files cannot be split).

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-26 Thread Daniel Siegmann
any reason not to enable it, but I haven't had any problem with it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Sat, May 20, 2017 at 9:14 PM, Kabeer Ahmed wrote: > Thank you Takeshi. > > As far as I s

Documentation on "Automatic file coalescing for native data sources"?

2017-05-16 Thread Daniel Siegmann
nd Google was not helpful. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Daniel Siegmann
On Wed, Apr 12, 2017 at 4:11 PM, Sam Elamin wrote: > > When it comes to scheduling Spark jobs, you can either submit to an > already running cluster using things like Oozie or bash scripts, or have a > workflow manager like Airflow or Data Pipeline to create new clusters for > you. We went down t

Re: [Spark] Accumulators or count()

2017-03-01 Thread Daniel Siegmann
As you noted, Accumulators do not guarantee accurate results except in specific situations. I recommend never using them. This article goes into some detail on the problems with accumulators: http://imranrashid.com/posts/Spark-Accumulators/ On Wed, Mar 1, 2017 at 7:26 AM, Charles O. Bajomo < cha

Re: Spark #cores

2017-01-18 Thread Daniel Siegmann
I am not too familiar with Spark Standalone, so unfortunately I cannot give you any definite answer. I do want to clarify something though. The properties spark.sql.shuffle.partitions and spark.default.parallelism affect how your data is split up, which will determine the *total* number of tasks,

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
ny way to disable it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic wrote: > Hi, > > I write a randomly generated 30,000-row dataframe to parquet. I verify

Re: Few questions on reliability of accumulators value.

2016-12-12 Thread Daniel Siegmann
Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulators/ On Sun, Dec 11, 2016 at 11:27 AM, Sudev A C wrote: > Please help. > Anyone, any thought

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet, with the interesting > case

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Daniel Siegmann
tions. Personally, I would just use a separate JSON library (e.g. json4s) to parse this metadata into an object, rather than trying to read it in through Spark. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: UseCase_Design_Help

2016-10-05 Thread Daniel Siegmann
I think it's fine to read animal types locally because there are only 70 of them. It's just that you want to execute the Spark actions in parallel. The easiest way to do that is to have only a single action. Instead of grabbing the result right away, I would just add a column for the animal type a

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Daniel Siegmann
Thanks for the help everyone. I was able to get permissions configured for my cluster so it now has access to the bucket on the other account. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Wed, Sep 28, 2016 at 10:03 AM

Access S3 buckets in multiple accounts

2016-09-27 Thread Daniel Siegmann
have access to the S3 bucket in the EMR cluster's AWS account. Is there any way for Spark to access S3 buckets in multiple accounts? If not, is there any best practice for how to work around this? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floo

Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Daniel Siegmann
oders? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: What are using Spark for

2016-08-02 Thread Daniel Siegmann
Yes, you can use Spark for ETL, as well as feature engineering, training, and scoring. ~Daniel Siegmann On Tue, Aug 2, 2016 at 3:29 PM, Mich Talebzadeh wrote: > Hi, > > If I may say, if you spend sometime going through this mailing list in > this forum and see the variety of topic

Re: Apache design patterns

2016-06-09 Thread Daniel Siegmann
On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux wrote: > 1. Should I use dataframes to ‘pull the source data? If so, do I do > a groupby and order by as part of the SQL query? > Seems reasonable. If you use Scala you might want to define a case class and convert the data frame to a datase

Re: Saving Parquet files to S3

2016-06-09 Thread Daniel Siegmann
I don't believe there's anyway to output files of a specific size. What you can do is partition your data into a number of partitions such that the amount of data they each contain is around 1 GB. On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain wrote: > Hello Team, > > > > I want to write parquet fil

Re: [ML] Training with bias

2016-04-12 Thread Daniel Siegmann
: String = fitIntercept: whether to fit an intercept term (default: > true) > > On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann > wrote: > >> I'm trying to understand how I can add a bias when training in Spark. I >> have only a vague familiarity with this subject, so I

[ML] Training with bias

2016-04-11 Thread Daniel Siegmann
at would just be part of the model. ~Daniel Siegmann

Re: cluster randomly re-starting jobs

2016-03-21 Thread Daniel Siegmann
e if there are multiple attempts. You can also see it in the Spark history server (under incomplete applications, if the second attempt is still running). ~Daniel Siegmann On Mon, Mar 21, 2016 at 9:58 AM, Ted Yu wrote: > Can you provide a bit more information ? > > Release of Spark an

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
There are potential > solutions to these but they haven't been implemented as yet. > > On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann > wrote: > >> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath > > wrote: >> >>> Would you mind letting us know the # t

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows together have > a few 10,00

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann
for a 20 million > size dense weight vector (which should only be a few 100MB memory), so > perhaps something else is going on. > > Nick > > On Tue, 8 Mar 2016 at 22:55 Daniel Siegmann > wrote: > >> Just for the heck of it I tried the old MLlib implementation, but it had &

Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
g.apache.spark.mllib.classification.LogisticRegressionWithLBFGS >> >> Only downside is that you can't use the pipeline framework from spark ml. >> >> Cheers, >> Devin >> >> >> >> On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann < >> dan

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
e are you using to train the model? If you haven't > tried yet, you should consider the SparseVector > > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector > > > On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann < > d

Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
transparently? Any advice would be appreciated. ~Daniel Siegmann

Re: Serializing collections in Datasets

2016-03-03 Thread Daniel Siegmann
I have confirmed this is fixed in Spark 1.6.1 RC 1. Thanks. On Tue, Feb 23, 2016 at 1:32 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Yes, I will test once 1.6.1 RC1 is released. Thanks. > > On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust > wrote: > &g

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Daniel Siegmann
In the past I have seen this happen when I filled up HDFS and some core nodes became unhealthy. There was no longer anywhere to replicate the data. >From your command it looks like you should have 1 master and 2 core nodes in your cluster. Can you verify both the core nodes are healthy? On Wed, Ma

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Daniel Siegmann
How many core nodes does your cluster have? On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-37 conf]$ cd /usr/bin/ > [had

Re: Serializing collections in Datasets

2016-02-23 Thread Daniel Siegmann
Yes, I will test once 1.6.1 RC1 is released. Thanks. On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann <

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Daniel Siegmann
During testing you will typically be using some finite data. You want the stream to shut down automatically when that data has been consumed so your test shuts down gracefully. Of course once the code is running in production you'll want it to keep waiting for new records. So whether the stream sh

Serializing collections in Datasets

2016-02-22 Thread Daniel Siegmann
e plans to support serializing arbitrary Seq values in datasets, or must everything be converted to Array? ~Daniel Siegmann

Re: Is this likely to cause any problems?

2016-02-19 Thread Daniel Siegmann
With EMR supporting Spark, I don't see much reason to use the spark-ec2 script unless it is important for you to be able to launch clusters using the bleeding edge version of Spark. EMR does seem to do a pretty decent job of keeping up to date - the latest version (4.3.0) supports the latest Spark

Re: data type transform when creating an RDD object

2016-02-17 Thread Daniel Siegmann
This should do it (for the implementation of your parse method, Google should easily provide information - SimpleDateFormatter is probably what you want): def parseDate(s: String): java.sql.Date = { ... } val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.spli

Re: Spark 2.0.0 release plan

2016-01-27 Thread Daniel Siegmann
Will there continue to be monthly releases on the 1.6.x branch during the additional time for bug fixes and such? On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers wrote: > thanks thats all i needed > > On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote: > >> I think it will come significantly late

Re: Too many tasks killed the scheduler

2016-01-12 Thread Daniel Siegmann
As I understand it, your initial number of partitions will always depend on the initial data. I'm not aware of any way to change this, other than changing the configuration of the underlying data store. Have you tried reading the data in several data frames (e.g. one data frame per day), coalescin

Zip data frames

2015-12-29 Thread Daniel Siegmann
RDD has methods to zip with another RDD or with an index, but there's no equivalent for data frames. Anyone know a good way to do this? I thought I could just convert to RDD, do the zip, and then convert back, but ... 1. I don't see a way (outside developer API) to convert RDD[Row] directly

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to thinkin

Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of partiti

Re: Unit tests of spark application

2015-07-10 Thread Daniel Siegmann
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire wrote: > I want to write junit test cases in scala for testing spark application. > Is there any guide or link which I can refer. > https://spark.apache.org/docs/latest/programming-guide.html#unit-testing Typically I create test data using SparkCo

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in src/ma

Re: Want to avoid groupByKey as its running for ever

2015-06-30 Thread Daniel Siegmann
If the number of items is very large, have you considered using probabilistic counting? The HyperLogLogPlus class from stream-lib

Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
tastorePath;create=true") > setConf("hive.metastore.warehouse.dir", warehousePath.toString) > } > > Cheers > > On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann < > daniel.siegm...@teamaol.com> wrote: > >> I am trying to unit test some code which takes

Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann
I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've run into some hurdles trying to unit test this, and I'm wondering if anyone has a good approach. The metastore DB is automatically created in

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika wrote: > Hi,

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-13 Thread Daniel Siegmann
On Thu, Mar 12, 2015 at 1:45 AM, wrote: > > In your response you say “When you call reduce and *similar *methods, > each partition can be reduced in parallel. Then the results of that can be > transferred across the network and reduced to the final result”. By similar > methods do you mean all ac

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs w

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread Daniel Siegmann
An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify a

Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
OK, good to know data frames are still experimental. Thanks Michael. On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust wrote: > We have been using Spark SQL in production for our customers at Databricks > for almost a year now. We also know of some very large production > deployments elsewhere.

Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
I thought removing the alpha tag just meant the API was stable? Speaking of which, aren't there major changes to the API coming in 1.3? Why are you marking the API as stable before these changes have been widely used? On Sat, Feb 28, 2015 at 5:17 PM, Michael Armbrust wrote: > We are planning to

Re: Filtering keys after map+combine

2015-02-19 Thread Daniel Siegmann
etwork shuffle, in reduceByKey after map + > combine are done, I would like to filter the keys based on some threshold... > > Is there a way to get the key, value after map+combine stages so that I > can run a filter on the keys ? > > Thanks. > Deb > -- Daniel Siegmann,

Re: Escape commas in file names

2014-12-26 Thread Daniel Siegmann
Thanks for the replies. Hopefully this will not be too difficult to fix. Why not support multiple paths by overloading the parquetFile method to take a collection of strings? That way we don't need an appropriate delimiter. On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao wrote: > I’ve created a ji

Escape commas in file names

2014-12-23 Thread Daniel Siegmann
I am trying to load a Parquet file which has a comma in its name. Yes, this is a valid file name in HDFS. However, sqlContext.parquetFile interprets this as a comma-separated list of parquet files. Is there any way to escape the comma so it is treated as part of a single file name? -- Daniel

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
e: >> >>> Say I have two RDDs with the following values >>> >>> x = [(1, 3), (2, 4)] >>> >>> and >>> >>> y = [(3, 5), (4, 7)] >>> >>> and I want to have >>> >>> z = [(1, 3),

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
3), (2, 4)] > > and > > y = [(3, 5), (4, 7)] > > and I want to have > > z = [(1, 3), (2, 4), (3, 5), (4, 7)] > > How can I achieve this. I know you can use outerJoin followed by map to > achieve this, but is there a more direct way for this. > -- Daniel

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
at 7:45 PM, Michael Armbrust wrote: > I think you should also be able to get away with casting it back and forth > in this case using .asInstanceOf. > > On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann > wrote: > >> I have a class which is a subclass of Tupl

PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
uld be to define my own equivalent of PairRDDFunctions which works with my class, does type conversions to Tuple2, and delegates to PairRDDFunctions. Does anyone know a better way? Anyone know if there will be a significant performance penalty with that approach? -- Daniel Siegmann, Software Dev

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
inalRow) > as map/reduce tasks such that rows with same original string key get same > numeric consecutive key? > > Any hints? > > best, > /Shahab > > ​ > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
Is there a mechanism similar to MR where we can ensure each > partition is assigned some amount of data by size, by setting some block > size parameter? > > > > On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann > wrote: > >> On Thu, Nov 13, 2014 at 3:24 PM, Pala

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
ulables seem to have some extra complications and overhead. > > > > So… > > > > What’s the real difference between an accumulator/accumulable and > aggregating an RDD? When is one method of aggregation preferred over the > other? > > > > Thanks, > >

Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis wrote: > The cluster runs Mesos and I can see the tasks in the Mesos UI but most > are not doing much - any hints about that UI > > On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann < > daniel.s

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Daniel Siegmann
g MacAddress to determine which machine is running the > code. > As far as I can tell a simple word count is running in one thread on one > machine and the remainder of the cluster does nothing, > This is consistent with tests where I write to sdout from functions and > see little

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
very RDD val rdds = paths.map { path => sc.textFile(path).map(myFunc) } val completeRdd = sc.union(rdds) Does that make any sense? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Accessing RDD within another RDD map

2014-11-13 Thread Daniel Siegmann
) > > The same goes for any other action I am trying to perform inside the map > statement. I am failing to understand what I am doing wrong. > Can anyone help with this? > > Thanks, > Simone Franzini, PhD > > http://www.linkedin.com/in/simonefranzini > -- Daniel Siegm

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
hu, Nov 13, 2014 at 10:11 AM, Rishi Yadav wrote: > If your data is in hdfs and you are reading as textFile and each file is > less than block size, my understanding is it would always have one > partition per file. > > > On Thursday, November 13, 2014, Daniel Siegmann > wrote: &

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
; I tried to lookup online but haven't found any pointers so far. > > > Thanks > pala > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
zes without destroying the RDD for sibsequent > processing. persist will do this but these are big and perisist seems > expensive and I am unsure of which StorageLevel is needed, Is there a way > to clone a JavaRDD or does anyong have good ideas on how to do this? > -- Dan

Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
d D as parquet files. > > > > I'm wondering if spark can restore B and D from the parquet files using a > > customized persist and restore procedure? > > > > > > > > > > ----- > To

Re: SparkContext.stop() ?

2014-10-31 Thread Daniel Siegmann
> Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >

Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
nce{ > inAnyOrder{ > (sparkContext.broadcast[DatasetLoader] > _).expects(trainingDatasetLoader).returns(broadcastTrainingDatasetLoader) > } > } > > val sparkInvoker = new SparkJobInvoker(sparkContext, > trainingDatasetLoader) > > when(inputRDD.mapPar

Re: Play framework

2014-10-16 Thread Daniel Siegmann
ou have figured out how to build and run a Play app with Spark-submit, > I would appreciate if you could share the steps and the sbt settings for > your Play app. > > > > Thanks, > > Mohammed > > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Spark inside Eclipse

2014-10-02 Thread Daniel Siegmann
> I am running Eclipse Kepler on a Macbook Pro with Mavericks >> Like one can run hadoop map/reduce applications from within Eclipse and >> debug and learn. >> >> thanks >> >> sanjay >> > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
, please kindly > reply to the sender indicating this fact and delete all copies of it from > your computer and network server immediately. Your cooperation is highly > appreciated. It is advised that any unauthorized use of confidential > information of Winbond is strictly prohibited; and any i

Re: about partition number

2014-09-29 Thread Daniel Siegmann
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to do operations on multiple RDD's

2014-09-26 Thread Daniel Siegmann
e an array of maps with values as keys and frequency as > values. > > Essentially I want something like zipPartitions but for arbitrarily many > RDD's, is there any such functionality or how would I approach this problem? > > Cheers, > > Johan > -- Daniel

Re: mappartitions data size

2014-09-26 Thread Daniel Siegmann
scr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
sage from your system. > This message and any attachments may contain information that is > confidential, privileged or exempt from disclosure. Delivery of this > message to any person other than the intended recipient is not intended to > waive any right or privilege. Message transmission is not guaranteed to be > secure or free of software viruses. > > *** > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Filter function problem

2014-09-09 Thread Daniel Siegmann
park User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH A

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
------ > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
ate-results-tp13062.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
or Hadoop is needed or mandatory for using Spark? that's not the > understanding I've. My understanding is that you can use spark with Hadoop > if you like from yarn2 but you could use spark standalone also without > hadoop. > > Please assist. I'm confused ! > > -Sanjeev > > > -

Re: Development environment issues

2014-08-25 Thread Daniel Siegmann
? > sbt or maven? > eclipse or idea? > jdk7 or 8? > I'm using Java 7 and Scala 2.10.x (not every framework I use supports later versions). SBT because I use the Play Framework, but I miss Maven. I haven't tried IntelliJ's Scala support, but it's probably worth a shot.

Re: heterogeneous cluster hardware

2014-08-21 Thread Daniel Siegmann
nal commands, e-mail: [hidden email] > >> > > > > > > > > If you reply to this email, your message will be added to the discussion > > below: > > > http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluste

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
st mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Develope

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
--- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
urrent tasks you can execute at one time. If you want more parallelism, > I think you just need more cores in your cluster--that is, bigger nodes, or > more nodes. > > Daniel, > > Have you been able to get around this limit? > > Nick > > > > On Fri, Aug 1, 2014 at 11

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
; > ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 -t > m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2 > *my-cluster* > > > ------ > *From:* Daniel Siegmann > *To:* Darin McBeath > *Cc:* Daniel Siegm

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Daniel Siegmann
n what the > documentation states). What would I want that value to be based on my > configuration below? Or, would I leave that alone? > > -- > *From:* Daniel Siegmann > *To:* user@spark.apache.org; Darin McBeath > *Sent:* Wednesday, July 30, 2014 5

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Daniel Siegmann
7;filter' and the default is the total number of cores available. > > I'm fairly new with Spark so maybe I'm just missing or misunderstanding > something fundamental. Any help would be appreciated. > > Thanks. > > Darin. > > -- Daniel Siegmann, Software Devel

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Daniel Siegmann
>> > Context for JUnit >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
e-tp10617.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Daniel Siegmann
(x,y) => x+y).collect >> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), >> (P(bob),1), (P(abe),1), (P(charly),1)) >> >> In contrast to the expected behavior, that should be equivalent to: >> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
he *only* thing you run on the cluster, you could also > configure the Workers to only report one core by manually launching the > spark.deploy.worker.Worker process with that flag (see > http://spark.apache.org/docs/latest/spark-standalone.html). > > Matei > > On Jul 14, 2014,

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
e(# nodes) seems to just allocate > one task per core, and so runs out of memory on the node. Is there any way > to give the scheduler a hint that the task uses lots of memory and cores so > it spreads it out more evenly? > > Thanks, > > Ravi Pandya > Microsoft Research >

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Daniel Siegmann
>> Thanks, >> Rahul Kumar Bhojwani >> 3rd year, B.Tech >> Computer Science Engineering >> National Institute Of Technology, Karnataka >> 9945197359 >> > > > > -- > Rahul K Bhojwani > 3rd Year B.Tech > Computer Science and Engineer

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Daniel Siegmann
From the data > injector and "Streaming" tab of web ui, it's running well. > > However, I see quite a lot of Active stages in web ui even some of them > have all of their tasks completed. > > I attach a screenshot for your reference. > > Do you ever see this k

  1   2   >