heap fragmentation in G1

2024-07-12 Thread aka.fe2s
Hi, We occasionally encounter OutOfMemoryError errors when running Spark 3.1 with Java 17, G1 garbage collector (region size = 32MB), and a 200GB heap. The OOM happens in the ShuffleExternalSorter when it attempts to allocate a 1GB array for the pointer array, despite having about 80GB of heap ava

static dataframe to streaming

2019-11-05 Thread aka.fe2s
Hi All, What is the most efficient way of converting static dataframe to streaming (structured streaming)? I have a custom sink implemented for structured streaming and I would like to use it to write a static dataframe. I know that I can write a dataframe to files and then source them to a create

off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread aka.fe2s
Hi folks, What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention it no longer. -- Oleksiy Dyagilev

Re: How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread aka.fe2s
Most likely you are missing an import statement that enables some Scala implicits. I haven't used this connector, but looks like you need "import com.couchbase.spark._" -- Oleksiy Dyagilev On Wed, Sep 7, 2016 at 9:42 AM, Devi P.V wrote: > I am newbie in CouchBase.I am trying to write data into

Re: LabeledPoint creation

2016-09-07 Thread aka.fe2s
It has 4 categories a = 1 0 0 b = 0 0 0 c = 0 1 0 d = 0 0 1 -- Oleksiy Dyagilev On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > Any help on above mail use case ? > > Regards, > Rajesh > > On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar

Re: ml and mllib persistence

2016-07-12 Thread aka.fe2s
Okay, I think I found an answer on my question. Some models (for instance org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs, so just serializing these objects will not work. -- Oleksiy Dyagilev On Tue, Jul 12, 2016 at 5:40 PM, aka.fe2s wrote: > What is the reason Sp

Re: location of a partition in the cluster/ how parallelize method distribute the RDD partitions over the cluster.

2016-07-12 Thread aka.fe2s
The local collection is distributed into the cluster when you run any action http://spark.apache.org/docs/latest/programming-guide.html#actions due to laziness of RDD. If you want to control how the collection is split into parititions, you can create your own RDD implementation and implement this

Re: RDD for loop vs foreach

2016-07-12 Thread aka.fe2s
Correct. It's desugared into rdd.foreach() by Scala compiler -- Oleksiy Dyagilev On Tue, Jul 12, 2016 at 6:58 PM, philipghu wrote: > Hi, > > I'm new to Spark and Scala as well. I understand that we can use foreach to > apply a function to each element of an RDD, like rdd.foreach > (x=>println(

ml and mllib persistence

2016-07-12 Thread aka.fe2s
What is the reason Spark has an individual implementations of read/write routines for every model in mllib and ml? (Saveable and MLWritable trait impls) Wouldn't a generic implementation via Java serialization mechanism work? I would like to use it to store the models to a custom storage. -- Olek

Re: Reading Back a Cached RDD

2016-03-28 Thread aka.fe2s
Nick, what is your use-case? On Thu, Mar 24, 2016 at 11:55 PM, Marco Colombo wrote: > You can persist off-heap, for example with tachyon, now called Alluxio. > Take a look at off heap peristance > > Regards > > > Il giovedì 24 marzo 2016, Holden Karau ha scritto: > >> Even checkpoint() is mayb

Re: HdfsWordCount only counts some of the words

2014-09-23 Thread aka.fe2s
I guess because this example is stateless, so it outputs counts only for given RDD. Take a look at stateful word counter StatefulNetworkWordCount.scala On Wed, Sep 24, 2014 at 4:29 AM, SK wrote: > > I execute it as follows: > > $SPARK_HOME/bin/spark-submit --master --class > org.apache.spark

MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread aka.fe2s
Hi, I'm looking for available online ML algorithms (that improve model with new streaming data). The only one I found is linear regression. Is there anything else implemented as part of MLlib? Thanks, Oleksiy.