from:"Eugen Cepoi"

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Eugen Cepoi

-08:00 Ted Yu : > bq. that solved some problems > > Is there any problem that was not solved by the tweak ? > > Thanks > > On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi wrote: > >> You can limit the amount of memory spark will use for shuffle even in 1.6. >

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Eugen Cepoi

You can limit the amount of memory spark will use for shuffle even in 1.6. You can do that by tweaking the spark.memory.fraction and the spark.storage.fraction. For example if you want to have no shuffle cache at all you can set the storage.fraction to 1 or something close, to let a small place for

Re: Bad Digest error while doing aws s3 put

2016-02-08 Thread Eugen Cepoi

I had similar problems with multi part uploads. In my case the real error was something else which was being masked by this issue https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad digest exception was a side effect and not the original issue. For me it was some library version c

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi

Do you have a large number of tasks? This can happen if you have a large number of tasks and a small driver or if you use accumulators of lists like datastructures. 2015-12-11 11:17 GMT-08:00 Zhan Zhang : > I think you are fetching too many results to the driver. Typically, it is > not recommende

Re: Mllib explain feature for tree ensembles

2015-10-28 Thread Eugen Cepoi

13> > to > estimate the importance of each feature. > > 2015-10-28 18:29 GMT+08:00 Eugen Cepoi : > >> Hey, >> >> Is there some kind of "explain" feature implemented in mllib for the >> algorithms based on tree ensembles? >> Some method to which you

Mllib explain feature for tree ensembles

2015-10-28 Thread Eugen Cepoi

Hey, Is there some kind of "explain" feature implemented in mllib for the algorithms based on tree ensembles? Some method to which you would feed in a single feature vector and it would return/print what features contributed to the decision or how much each feature contributed "negatively" and "po

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Eugen Cepoi

ports are accessible within the cluster. > > Thanks > Best Regards > > On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi > wrote: > >> Huh indeed this worked, thanks. Do you know why this happens, is that >> some known issue? >> >> Thanks, >> Eugen &g

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Eugen Cepoi

t 19, 2015 at 6:21 PM, Eugen Cepoi > wrote: > >> Hi, >> >> I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. >> The job is reading data from Kinesis and the batch size is of 30s (I used >> the same value for the kinesis checkpointing). >&g

spark streaming failing to replicate blocks

2015-10-19 Thread Eugen Cepoi

Hi, I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. The job is reading data from Kinesis and the batch size is of 30s (I used the same value for the kinesis checkpointing). In the executor logs I can see every 5 seconds a sequence of stacktraces indicating that the block replication

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Eugen Cepoi

this is the issue, need to find a way to confirm that now... 2015-10-15 16:12 GMT+07:00 Eugen Cepoi : > Hey, > > A quick update on other things that have been tested. > > When looking at the compiled code of the spark-streaming-kinesis-asl jar > everything looks normal (the

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Eugen Cepoi

Hey, A quick update on other things that have been tested. When looking at the compiled code of the spark-streaming-kinesis-asl jar everything looks normal (there is a class that implements SyncMap and it is used inside the receiver). Starting a spark shell and using introspection to instantiate

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi

> > > -- > Alexandre Rodrigues > > On Thu, Jul 2, 2015 at 5:37 PM, Eugen Cepoi wrote: > >> >> >> *"The thing is that foreach forces materialization of the RDD and it >> seems to be executed on the driver program"* >> What makes you

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi

*"The thing is that foreach forces materialization of the RDD and it seems to be executed on the driver program"* What makes you think that? No, foreach is run in the executors (distributed) and not in the driver. 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues < alex.jose.rodrig...@gmail.com>: >

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread Eugen Cepoi

Are you using yarn? If yes increase the yarn memory overhead option. Yarn is probably killing your executors. Le 26 juin 2015 20:43, "XianXing Zhang" a écrit : > Do we have any update on this thread? Has anyone met and solved similar > problems before? > > Any pointers will be greatly appreciated

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi

You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu : > See this related thread: > http://search-hadoop.com/m/q3RTtiYm8wgHego1 > > On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain wrote: > >> >> Hi, >> How do we read files from multiple directories using newApiHa

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi

Comma separated paths works only with spark 1.4 and up 2015-06-26 18:56 GMT+02:00 Eugen Cepoi : > You can comma separate them or use globbing patterns > > 2015-06-26 18:54 GMT+02:00 Ted Yu : > >> See this related thread: >> http://search-hadoop.com/m/q3RTtiYm8wgHego1 &g

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi

that the threads are being started at the begining and will last until the end of the jvm. 2015-06-18 15:32 GMT+02:00 Eugen Cepoi : > > > 2015-06-18 15:17 GMT+02:00 Guillaume Pitel : > >> I was thinking exactly the same. I'm going to try it, It doesn't really >> m

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi

2015-06-18 15:17 GMT+02:00 Guillaume Pitel : > I was thinking exactly the same. I'm going to try it, It doesn't really > matter if I lose an executor, since its sketch will be lost, but then > reexecuted somewhere else. > > I mean that between the action that will update the sketches and the acti

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi

Yeah thats the problem. There is probably some "perfect" num of partitions that provides the best balance between partition size and memory and merge overhead. Though it's not an ideal solution :( There could be another way but very hacky... for example if you store one sketch in a singleton per j

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi

Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and merged. When a task needs to be run n times (multiple rdds depend on this one, some partition loss later in the chain etc) then th

Re: Spark on EMR

2015-06-17 Thread Eugen Cepoi

It looks like it is a wrapper around https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark So basically adding an option -v,1.4.0.a should work. https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html 2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda : >

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread Eugen Cepoi

Cache is more general. ReduceByKey involves a shuffle step where the data will be in memory and on disk (for what doesn't hold in memory). The shuffle files will remain around until the end of the job. The blocks from memory will be dropped if memory is needed for other things. This is an optimisat

Re: How to set KryoRegistrator class in spark-shell

2015-06-11 Thread Eugen Cepoi

Or launch the spark-shell with --conf spark.kryo.registrator=foo.bar.MyClass 2015-06-11 14:30 GMT+02:00 Igor Berman : > Another option would be to close sc and open new context with your custom > configuration > On Jun 11, 2015 01:17, "bhomass" wrote: > >> you need to register using spark-defaul

Re: Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread Eugen Cepoi

Hi 2015-06-04 15:29 GMT+02:00 James Aley : > Hi, > > We have a load of Avro data coming into our data systems in the form of > relatively small files, which we're merging into larger Parquet files with > Spark. I've been following the docs and the approach I'm taking seemed > fairly obvious, and

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi

/B/C/D/D/2015/05/22/out-r-*.avro") > > } > > > This is my method, can you show me where should i modify to use > FileInputFormat ? If you add the path there what should you give while > invoking newAPIHadoopFile > > On Wed, May 27, 2015 at 2:20 PM, Eugen Cepoi > wrote: >

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi

You can do that using FileInputFormat.addInputPath 2015-05-27 10:41 GMT+02:00 ayan guha : > What about /blah/*/blah/out*.avro? > On 27 May 2015 18:08, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > >> I am doing that now. >> Is there no other way ? >> >> On Wed, May 27, 2015 at 12:40 PM, Akhil Das >> wrote: >> >>> H

Re: Questions about Accumulators

2015-05-03 Thread Eugen Cepoi

Yes that's it. If a partition is lost, to recompute it, some steps will need to be re-executed. Perhaps the map function in which you update the accumulator. I think you can do it more safely in a transformation near the action, where it is less likely that an error will occur (not always true...)

Re: Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi

using a plain TextOutputFormat, the multi part upload works, this confirms that the lzo compression is probably the problem... but it is not a solution :( 2015-04-13 18:46 GMT+02:00 Eugen Cepoi : > Hi, > > I am not sure my problem is relevant to spark, but perhaps someone else >

Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi

Hi, I am not sure my problem is relevant to spark, but perhaps someone else had the same error. When I try to write files that need multipart upload to S3 from a job on EMR I always get this error: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-18 Thread Eugen Cepoi

ble to work > around by forcefully committing one of the rdds right before the union > into cache, and forcing that by executing take(1). Nothing else ever > helped. > > Seems like yet-undiscovered 1.2.x thing. > > On Tue, Mar 17, 2015 at 4:21 PM, Eugen Cepoi > wrote: > &

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-17 Thread Eugen Cepoi

03-13 19:18 GMT+01:00 Eugen Cepoi : > Hum increased it to 1024 but doesn't help still have the same problem :( > > 2015-03-13 18:28 GMT+01:00 Eugen Cepoi : > >> The one by default 0.07 of executor memory. I'll try increasing it and >> post back the result. >

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi

Hum increased it to 1024 but doesn't help still have the same problem :( 2015-03-13 18:28 GMT+01:00 Eugen Cepoi : > The one by default 0.07 of executor memory. I'll try increasing it and > post back the result. > > Thanks > > 2015-03-13 18:09 GMT+01:00 Ted Yu : >

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi

The one by default 0.07 of executor memory. I'll try increasing it and post back the result. Thanks 2015-03-13 18:09 GMT+01:00 Ted Yu : > Might be related: what's the value for spark.yarn.executor.memoryOverhead ? > > See SPARK-6085 > > Cheers > > On Fri, Mar 1

Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi

Hi, I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange thing, the exact same code does work (after upgrade) in the spark-shell. But this information might be misleading as it works with 1.1.1... *The job takes as input two data sets:* - rdd A of +170gb (with less it is h

Re: How to design a long live spark application

2015-02-05 Thread Eugen Cepoi

Yes you can submit multiple actions from different threads to the same SparkContext. It is safe. Indeed what you want to achieve is quite common. Expose some operations over a SparkContext through HTTP. I have used spray for this and it just worked fine. At bootstrap of your web app, start a spark

Re: application as a service

2014-08-17 Thread Eugen Cepoi

Hi, You can achieve it by running a spray service for example that has access to the RDD in question. When starting the app you first build your RDD and cache it. In your spray "endpoints" you will translate the HTTP requests to operations on that RDD. 2014-08-17 17:27 GMT+02:00 Zhanfeng Huo :

Re: collect() on small group of Avro files causes plain NullPointerException

2014-07-22 Thread Eugen Cepoi

Do you have a list/array in your avro record? If yes this could cause the problem. I experienced this kind of problem and solved it by providing custom kryo ser/de for avro lists. Also be carefull spark reuses records, so if you just read and then don't copy/transform them you would end up with the

Re: Using Spark as web app backend

2014-06-25 Thread Eugen Cepoi

Yeah I agree with Koert, it would be the lightest solution. I have used it quite successfully and it just works. There is not much spark specifics here, you can follow this example https://github.com/jacobus/s4 on how to build your spray service. Then the easy solution would be to have a SparkCont

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi

me a little more about "ADD_JARS". In order to ensure > my spark_shell has all required jars, I added the jars to the "$CLASSPATH" > in the compute_classpath.sh script. is there another way of doing it? > > Shivani > > > On Fri, Jun 20, 2014 at 9:47 AM, Eugen C

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi

06-20 17:15 GMT+02:00 Shivani Rao : > Hello Abhi, I did try that and it did not work > > And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So > how did you overcome this problem? > > Shivani > > > On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi > w

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi

Le 20 juin 2014 01:46, "Shivani Rao" a écrit : > > Hello Andrew, > > i wish I could share the code, but for proprietary reasons I can't. But I can give some idea though of what i am trying to do. The job reads a file and for each line of that file and processors these lines. I am not doing anythin

Re: spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-06 Thread Eugen Cepoi

t. If you opened a JIRA for > that I'm sure someone would pick it up. > > On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi wrote: > > Is it on purpose that when setting SPARK_CONF_DIR spark submit still > loads > > the properties file from SPARK_HOME/conf/spark-defauls.conf

spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-03 Thread Eugen Cepoi

Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads the properties file from SPARK_HOME/conf/spark-defauls.conf ? IMO it would be more natural to override what is defined in SPARK_HOME/conf by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by command line ar

Re: Packaging a spark job using maven

2014-05-19 Thread Eugen Cepoi

2014-05-19 10:35 GMT+02:00 Laurent T : > Hi Eugen, > > Thanks for your help. I'm not familiar with the shaded plugin and i was > wondering: does it replace the assembly plugin ? Nope it doesn't replace it. It allows you to make "fat jars" and other nice things such as relocating classes to some

spark 0.9.1 textFile hdfs unknown host exception

2014-05-16 Thread Eugen Cepoi

Hi, I have some strange behaviour when using textFile to read some data from HDFS in spark 0.9.1. I get UnknownHost exceptions, where hadoop client tries to resolve the dfs.nameservices and fails. So far: - this has been tested inside the shell - the exact same code works with spark-0.8.1 - t

Re: Packaging a spark job using maven

2014-05-16 Thread Eugen Cepoi

Laurent the problem is that the reference.conf that is embedded in akka jars is being overriden by some other conf. This happens when multiple files have the same name. I am using Spark with maven. In order to build the fat jar I use the shade plugin and it works pretty well. The trick here is to u

Re: spark 0.9.1 textFile hdfs unknown host exception

2014-05-15 Thread Eugen Cepoi

HADOOP_CONF_DIR is not shared with the workers when set only on the driver (it was not defined in spark-env)? Also wouldn't it be more natural to create the conf on driver side and then share it with the workers? 2014-05-09 10:51 GMT+02:00 Eugen Cepoi : > Hi, > > I have some strange

Re: Dead lock running multiple Spark Jobs on Mesos

2014-05-13 Thread Eugen Cepoi

I have a similar issue (but with spark 0.9.1) when a shell is active. Multiple jobs run fine, but when the shell is active (even if at the moment is not using any CPU) I encounter the exact same behaviour. At the moment I don't know what happens and how to solve it, but I was planning to have a lo

Re: what is the best way to do cartesian

2014-04-25 Thread Eugen Cepoi

Depending on the size of the rdd you could also do a collect broadcast and then compute the product in a map function over the other rdd. If this is the same rdd you might also want to cache it. This pattern worked quite good for me Le 25 avr. 2014 18:33, "Alex Boisvert" a écrit : > You might wan

Re: Pig on Spark

2014-04-25 Thread Eugen Cepoi

It depends, personally I have the opposite opinion. IMO expressing pipelines in a functional language feels natural, you just have to get used with the language (scala). Testing spark jobs is easy where testing a Pig script is much harder and not natural. If you want a more high level language t

Re: Custom KryoSerializer

2014-04-23 Thread Eugen Cepoi

I had a similar need the solution I used is: - Define a base implementation of KryoRegistrator (that will register all common classes/custom ser/deser) - make registerClasses method final, so subclasses don't override it - Define another method that would be overriden by subclasses that need to

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi

/201312.mbox/%3CCAPud8Tq7fK5j2Up9dDdRQ=y1efwidjnmqc55o9jm5dh7rpd...@mail.gmail.com%3E > . > > > On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi wrote: > >> Because it happens to reference something outside the closures scope that >> will reference some other objects (that you don&#x

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi

014-04-17 23:28 GMT+02:00 Flavio Pompermaier : > Thanks again Eugen! I don't get the point..why you prefer to avoid kyro > ser for closures?is there any problem with that? > On Apr 17, 2014 11:10 PM, "Eugen Cepoi" wrote: > >> You have two kind of ser : data and

Re: RDD collect help

2014-04-17 Thread Eugen Cepoi

Functions. Am I wrong or > this is a limit of Spark? > On Apr 15, 2014 1:36 PM, "Flavio Pompermaier" > wrote: > >> Ok thanks for the help! >> >> Best, >> Flavio >> >> >> On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi wrote: >> >>

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi

it reduce the data at the granularity of point rather than the > partition results (which is the collection of points). So is there a way to > reduce the data at the granularity of partitions? > > Thanks, > > Yanzhe > > On Wednesday, April 16, 2014 at 2:24 AM, Eugen Cepoi wrote:

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi

It depends on your algorithm but I guess that you probably should use reduce (the code probably doesn't compile but it shows you the idea). val result = data.reduce { case (left, right) => skyline(left ++ right) } Or in the case you want to merge the result of a partition with another one you c

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi

l be transfered (collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier : > Ok, that's fair enough. But why things work up to the collect?during map > and filter objects are not serialized? > On Apr 15, 2014 1

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi

permaier : > Thanks Eugen for tgee reply. Could you explain me why I have the > problem?Why my serialization doesn't work? > On Apr 14, 2014 6:40 PM, "Eugen Cepoi" wrote: > >> Hi, >> >> as a easy workaround you can enable Kryo serialization >> http:

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi

Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier : > Hi to all, > > in my application I read objects that are not serializable because I > cannot modify the sources. > So I trie

Re:

2014-02-25 Thread Eugen Cepoi

Yes it is doing it twice, try to cache the initial RDD. 2014-02-25 8:14 GMT+01:00 Soumitra Kumar : > I have a code which reads an HBase table, and counts number of rows > containing a field. > > def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : > RDD[List[Array[Byte]]] = { >

60 matches

Mail list logo