-08:00 Ted Yu :
> bq. that solved some problems
>
> Is there any problem that was not solved by the tweak ?
>
> Thanks
>
> On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi wrote:
>
>> You can limit the amount of memory spark will use for shuffle even in 1.6.
>
You can limit the amount of memory spark will use for shuffle even in 1.6.
You can do that by tweaking the spark.memory.fraction and the
spark.storage.fraction. For example if you want to have no shuffle cache at
all you can set the storage.fraction to 1 or something close, to let a
small place for
I had similar problems with multi part uploads. In my case the real error
was something else which was being masked by this issue
https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad
digest exception was a side effect and not the original issue. For me it
was some library version c
Do you have a large number of tasks? This can happen if you have a large
number of tasks and a small driver or if you use accumulators of lists like
datastructures.
2015-12-11 11:17 GMT-08:00 Zhan Zhang :
> I think you are fetching too many results to the driver. Typically, it is
> not recommende
13>
> to
> estimate the importance of each feature.
>
> 2015-10-28 18:29 GMT+08:00 Eugen Cepoi :
>
>> Hey,
>>
>> Is there some kind of "explain" feature implemented in mllib for the
>> algorithms based on tree ensembles?
>> Some method to which you
Hey,
Is there some kind of "explain" feature implemented in mllib for the
algorithms based on tree ensembles?
Some method to which you would feed in a single feature vector and it would
return/print what features contributed to the decision or how much each
feature contributed "negatively" and "po
ports are accessible within the cluster.
>
> Thanks
> Best Regards
>
> On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi
> wrote:
>
>> Huh indeed this worked, thanks. Do you know why this happens, is that
>> some known issue?
>>
>> Thanks,
>> Eugen
&g
t 19, 2015 at 6:21 PM, Eugen Cepoi
> wrote:
>
>> Hi,
>>
>> I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN.
>> The job is reading data from Kinesis and the batch size is of 30s (I used
>> the same value for the kinesis checkpointing).
>&g
Hi,
I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN.
The job is reading data from Kinesis and the batch size is of 30s (I used
the same value for the kinesis checkpointing).
In the executor logs I can see every 5 seconds a sequence of stacktraces
indicating that the block replication
this is
the issue, need to find a way to confirm that now...
2015-10-15 16:12 GMT+07:00 Eugen Cepoi :
> Hey,
>
> A quick update on other things that have been tested.
>
> When looking at the compiled code of the spark-streaming-kinesis-asl jar
> everything looks normal (the
Hey,
A quick update on other things that have been tested.
When looking at the compiled code of the spark-streaming-kinesis-asl jar
everything looks normal (there is a class that implements SyncMap and it is
used inside the receiver).
Starting a spark shell and using introspection to instantiate
>
>
> --
> Alexandre Rodrigues
>
> On Thu, Jul 2, 2015 at 5:37 PM, Eugen Cepoi wrote:
>
>>
>>
>> *"The thing is that foreach forces materialization of the RDD and it
>> seems to be executed on the driver program"*
>> What makes you
*"The thing is that foreach forces materialization of the RDD and it seems
to be executed on the driver program"*
What makes you think that? No, foreach is run in the executors
(distributed) and not in the driver.
2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues <
alex.jose.rodrig...@gmail.com>:
>
Are you using yarn?
If yes increase the yarn memory overhead option. Yarn is probably killing
your executors.
Le 26 juin 2015 20:43, "XianXing Zhang" a écrit :
> Do we have any update on this thread? Has anyone met and solved similar
> problems before?
>
> Any pointers will be greatly appreciated
You can comma separate them or use globbing patterns
2015-06-26 18:54 GMT+02:00 Ted Yu :
> See this related thread:
> http://search-hadoop.com/m/q3RTtiYm8wgHego1
>
> On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain wrote:
>
>>
>> Hi,
>> How do we read files from multiple directories using newApiHa
Comma separated paths works only with spark 1.4 and up
2015-06-26 18:56 GMT+02:00 Eugen Cepoi :
> You can comma separate them or use globbing patterns
>
> 2015-06-26 18:54 GMT+02:00 Ted Yu :
>
>> See this related thread:
>> http://search-hadoop.com/m/q3RTtiYm8wgHego1
&g
that the threads are being started at the begining and will last until the
end of the jvm.
2015-06-18 15:32 GMT+02:00 Eugen Cepoi :
>
>
> 2015-06-18 15:17 GMT+02:00 Guillaume Pitel :
>
>> I was thinking exactly the same. I'm going to try it, It doesn't really
>> m
2015-06-18 15:17 GMT+02:00 Guillaume Pitel :
> I was thinking exactly the same. I'm going to try it, It doesn't really
> matter if I lose an executor, since its sketch will be lost, but then
> reexecuted somewhere else.
>
>
I mean that between the action that will update the sketches and the acti
Yeah thats the problem. There is probably some "perfect" num of partitions
that provides the best balance between partition size and memory and merge
overhead. Though it's not an ideal solution :(
There could be another way but very hacky... for example if you store one
sketch in a singleton per j
Hey,
I am not 100% sure but from my understanding accumulators are per partition
(so per task as its the same) and are sent back to the driver with the task
result and merged. When a task needs to be run n times (multiple rdds
depend on this one, some partition loss later in the chain etc) then th
It looks like it is a wrapper around
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html
2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda :
>
Cache is more general. ReduceByKey involves a shuffle step where the data
will be in memory and on disk (for what doesn't hold in memory). The
shuffle files will remain around until the end of the job. The blocks from
memory will be dropped if memory is needed for other things. This is an
optimisat
Or launch the spark-shell with --conf spark.kryo.registrator=foo.bar.MyClass
2015-06-11 14:30 GMT+02:00 Igor Berman :
> Another option would be to close sc and open new context with your custom
> configuration
> On Jun 11, 2015 01:17, "bhomass" wrote:
>
>> you need to register using spark-defaul
Hi
2015-06-04 15:29 GMT+02:00 James Aley :
> Hi,
>
> We have a load of Avro data coming into our data systems in the form of
> relatively small files, which we're merging into larger Parquet files with
> Spark. I've been following the docs and the approach I'm taking seemed
> fairly obvious, and
/B/C/D/D/2015/05/22/out-r-*.avro")
>
> }
>
>
> This is my method, can you show me where should i modify to use
> FileInputFormat ? If you add the path there what should you give while
> invoking newAPIHadoopFile
>
> On Wed, May 27, 2015 at 2:20 PM, Eugen Cepoi
> wrote:
>
You can do that using FileInputFormat.addInputPath
2015-05-27 10:41 GMT+02:00 ayan guha :
> What about /blah/*/blah/out*.avro?
> On 27 May 2015 18:08, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote:
>
>> I am doing that now.
>> Is there no other way ?
>>
>> On Wed, May 27, 2015 at 12:40 PM, Akhil Das
>> wrote:
>>
>>> H
Yes that's it. If a partition is lost, to recompute it, some steps will
need to be re-executed. Perhaps the map function in which you update the
accumulator.
I think you can do it more safely in a transformation near the action,
where it is less likely that an error will occur (not always true...)
using a plain TextOutputFormat, the multi
part upload works, this confirms that the lzo compression is probably the
problem... but it is not a solution :(
2015-04-13 18:46 GMT+02:00 Eugen Cepoi :
> Hi,
>
> I am not sure my problem is relevant to spark, but perhaps someone else
>
Hi,
I am not sure my problem is relevant to spark, but perhaps someone else had
the same error. When I try to write files that need multipart upload to S3
from a job on EMR I always get this error:
com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you
specified did not match what
ble to work
> around by forcefully committing one of the rdds right before the union
> into cache, and forcing that by executing take(1). Nothing else ever
> helped.
>
> Seems like yet-undiscovered 1.2.x thing.
>
> On Tue, Mar 17, 2015 at 4:21 PM, Eugen Cepoi
> wrote:
> &
03-13 19:18 GMT+01:00 Eugen Cepoi :
> Hum increased it to 1024 but doesn't help still have the same problem :(
>
> 2015-03-13 18:28 GMT+01:00 Eugen Cepoi :
>
>> The one by default 0.07 of executor memory. I'll try increasing it and
>> post back the result.
>
Hum increased it to 1024 but doesn't help still have the same problem :(
2015-03-13 18:28 GMT+01:00 Eugen Cepoi :
> The one by default 0.07 of executor memory. I'll try increasing it and
> post back the result.
>
> Thanks
>
> 2015-03-13 18:09 GMT+01:00 Ted Yu :
>
The one by default 0.07 of executor memory. I'll try increasing it and post
back the result.
Thanks
2015-03-13 18:09 GMT+01:00 Ted Yu :
> Might be related: what's the value for spark.yarn.executor.memoryOverhead ?
>
> See SPARK-6085
>
> Cheers
>
> On Fri, Mar 1
Hi,
I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange
thing, the exact same code does work (after upgrade) in the spark-shell.
But this information might be misleading as it works with 1.1.1...
*The job takes as input two data sets:*
- rdd A of +170gb (with less it is h
Yes you can submit multiple actions from different threads to the same
SparkContext. It is safe.
Indeed what you want to achieve is quite common. Expose some operations
over a SparkContext through HTTP.
I have used spray for this and it just worked fine.
At bootstrap of your web app, start a spark
Hi,
You can achieve it by running a spray service for example that has access
to the RDD in question. When starting the app you first build your RDD and
cache it. In your spray "endpoints" you will translate the HTTP requests to
operations on that RDD.
2014-08-17 17:27 GMT+02:00 Zhanfeng Huo :
Do you have a list/array in your avro record? If yes this could cause the
problem. I experienced this kind of problem and solved it by providing
custom kryo ser/de for avro lists. Also be carefull spark reuses records,
so if you just read and then don't copy/transform them you would end up
with the
Yeah I agree with Koert, it would be the lightest solution. I have
used it quite successfully and it just works.
There is not much spark specifics here, you can follow this example
https://github.com/jacobus/s4 on how to build your spray service.
Then the easy solution would be to have a SparkCont
me a little more about "ADD_JARS". In order to ensure
> my spark_shell has all required jars, I added the jars to the "$CLASSPATH"
> in the compute_classpath.sh script. is there another way of doing it?
>
> Shivani
>
>
> On Fri, Jun 20, 2014 at 9:47 AM, Eugen C
06-20 17:15 GMT+02:00 Shivani Rao :
> Hello Abhi, I did try that and it did not work
>
> And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So
> how did you overcome this problem?
>
> Shivani
>
>
> On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi
> w
Le 20 juin 2014 01:46, "Shivani Rao" a écrit :
>
> Hello Andrew,
>
> i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am not doing
anythin
t. If you opened a JIRA for
> that I'm sure someone would pick it up.
>
> On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi wrote:
> > Is it on purpose that when setting SPARK_CONF_DIR spark submit still
> loads
> > the properties file from SPARK_HOME/conf/spark-defauls.conf
Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads
the properties file from SPARK_HOME/conf/spark-defauls.conf ?
IMO it would be more natural to override what is defined in SPARK_HOME/conf
by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by
command line ar
2014-05-19 10:35 GMT+02:00 Laurent T :
> Hi Eugen,
>
> Thanks for your help. I'm not familiar with the shaded plugin and i was
> wondering: does it replace the assembly plugin ?
Nope it doesn't replace it. It allows you to make "fat jars" and other nice
things such as relocating classes to some
Hi,
I have some strange behaviour when using textFile to read some data from
HDFS in spark 0.9.1.
I get UnknownHost exceptions, where hadoop client tries to resolve the
dfs.nameservices and fails.
So far:
- this has been tested inside the shell
- the exact same code works with spark-0.8.1
- t
Laurent the problem is that the reference.conf that is embedded in akka
jars is being overriden by some other conf. This happens when multiple
files have the same name.
I am using Spark with maven. In order to build the fat jar I use the shade
plugin and it works pretty well. The trick here is to u
HADOOP_CONF_DIR is not shared with the workers when set
only on the driver (it was not defined in spark-env)?
Also wouldn't it be more natural to create the conf on driver side and then
share it with the workers?
2014-05-09 10:51 GMT+02:00 Eugen Cepoi :
> Hi,
>
> I have some strange
I have a similar issue (but with spark 0.9.1) when a shell is active.
Multiple jobs run fine, but when the shell is active (even if at the moment
is not using any CPU) I encounter the exact same behaviour.
At the moment I don't know what happens and how to solve it, but I was
planning to have a lo
Depending on the size of the rdd you could also do a collect broadcast and
then compute the product in a map function over the other rdd. If this is
the same rdd you might also want to cache it. This pattern worked quite
good for me
Le 25 avr. 2014 18:33, "Alex Boisvert" a écrit :
> You might wan
It depends, personally I have the opposite opinion.
IMO expressing pipelines in a functional language feels natural, you just
have to get used with the language (scala).
Testing spark jobs is easy where testing a Pig script is much harder and
not natural.
If you want a more high level language t
I had a similar need the solution I used is:
- Define a base implementation of KryoRegistrator (that will register all
common classes/custom ser/deser)
- make registerClasses method final, so subclasses don't override it
- Define another method that would be overriden by subclasses that need to
/201312.mbox/%3CCAPud8Tq7fK5j2Up9dDdRQ=y1efwidjnmqc55o9jm5dh7rpd...@mail.gmail.com%3E
> .
>
>
> On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi wrote:
>
>> Because it happens to reference something outside the closures scope that
>> will reference some other objects (that you don
014-04-17 23:28 GMT+02:00 Flavio Pompermaier :
> Thanks again Eugen! I don't get the point..why you prefer to avoid kyro
> ser for closures?is there any problem with that?
> On Apr 17, 2014 11:10 PM, "Eugen Cepoi" wrote:
>
>> You have two kind of ser : data and
Functions. Am I wrong or
> this is a limit of Spark?
> On Apr 15, 2014 1:36 PM, "Flavio Pompermaier"
> wrote:
>
>> Ok thanks for the help!
>>
>> Best,
>> Flavio
>>
>>
>> On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi wrote:
>>
>>
it reduce the data at the granularity of point rather than the
> partition results (which is the collection of points). So is there a way to
> reduce the data at the granularity of partitions?
>
> Thanks,
>
> Yanzhe
>
> On Wednesday, April 16, 2014 at 2:24 AM, Eugen Cepoi wrote:
It depends on your algorithm but I guess that you probably should use
reduce (the code probably doesn't compile but it shows you the idea).
val result = data.reduce { case (left, right) =>
skyline(left ++ right)
}
Or in the case you want to merge the result of a partition with another one
you c
l be
transfered (collect, shuffle, maybe perist to disk - but I am not sure for
this one).
2014-04-15 0:34 GMT+02:00 Flavio Pompermaier :
> Ok, that's fair enough. But why things work up to the collect?during map
> and filter objects are not serialized?
> On Apr 15, 2014 1
permaier :
> Thanks Eugen for tgee reply. Could you explain me why I have the
> problem?Why my serialization doesn't work?
> On Apr 14, 2014 6:40 PM, "Eugen Cepoi" wrote:
>
>> Hi,
>>
>> as a easy workaround you can enable Kryo serialization
>> http:
Hi,
as a easy workaround you can enable Kryo serialization
http://spark.apache.org/docs/latest/configuration.html
Eugen
2014-04-14 18:21 GMT+02:00 Flavio Pompermaier :
> Hi to all,
>
> in my application I read objects that are not serializable because I
> cannot modify the sources.
> So I trie
Yes it is doing it twice, try to cache the initial RDD.
2014-02-25 8:14 GMT+01:00 Soumitra Kumar :
> I have a code which reads an HBase table, and counts number of rows
> containing a field.
>
> def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) :
> RDD[List[Array[Byte]]] = {
>
60 matches
Mail list logo