ark-on-a-single-node-machine.html
>
> regars,
>
> 2018-05-23 22:30 GMT+02:00 Corey Nolet :
>
>> Please forgive me if this question has been asked already.
>>
>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
>> anyone know
Please forgive me if this question has been asked already.
I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
anyone knows of any efforts to implement the PySpark API on top of Apache
Arrow directly. In my case, I'm doing data science on a machine with 288
cores and 1TB of r
tions until the
other user gets worked into the model.
On Mon, Nov 27, 2017 at 3:08 PM, Corey Nolet wrote:
> I'm trying to use the MatrixFactorizationModel to, for instance, determine
> the latent factors of a user or item that were not used in the training
> data of the model. I
I'm trying to use the MatrixFactorizationModel to, for instance, determine
the latent factors of a user or item that were not used in the training
data of the model. I'm not as concerned about the rating as I am with the
latent factors for the user/item.
Thanks!
you don't care what each individual
> event/tuple does, e.g. of you push different event types to separate kafka
> topics and all you care is to do a count, what is the need for single event
> processing.
>
> On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet wrote:
>
>> i ha
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On
One thing I've noticed about Flink in my following of the project has been
that it has established, in a few cases, some novel ideas and improvements
over Spark. The problem with it, however, is that both the development team
and the community around it are very small and many of those novel
improv
Nevermind, a look @ the ExternalSorter class tells me that the iterator for
each key that's only partially ordered ends up being merge sorted by
equality after the fact. Wanted to post my finding on here for others who
may have the same questions.
On Tue, Mar 1, 2016 at 3:05 PM, Corey
How can this be assumed if the object used for the key, for instance, in
the case where a HashPartitioner is used, cannot assume ordering and
therefore cannot assume a comparator can be used?
On Tue, Mar 1, 2016 at 2:56 PM, Corey Nolet wrote:
> So if I'm using reduceByKey() with a HashPa
So if I'm using reduceByKey() with a HashPartitioner, I understand that the
hashCode() of my key is used to create the underlying shuffle files.
Is anything other than hashCode() used in the shuffle files when the data
is pulled into the reducers and run through the reduce function? The reason
I'm
ople will say.
> Corey do you have presentation available online?
>
> On 8 February 2016 at 05:16, Corey Nolet wrote:
>
>> Charles,
>>
>> Thank you for chiming in and I'm glad someone else is experiencing this
>> too and not just me. I know very well how the Spark
:
>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are
>> your input files lzo format, and you use sc.text() ? If memory is not
>> enough, spark will spill 3-4x of input data to disk.
>>
>>
>> -- 原始邮件
ot of children and doesn't even run concurrently with any other stages
so I ruled out the concurrency of the stages as a culprit for the
shuffliing problem we're seeing.
On Sun, Feb 7, 2016 at 7:49 AM, Corey Nolet wrote:
> Igor,
>
> I don't think the question is "why can
n
> if map side is ok, and you just reducing by key or something it should be
> ok, so some detail is missing...skewed data? aggregate by key?
>
> On 6 February 2016 at 20:13, Corey Nolet wrote:
>
>> Igor,
>>
>> Thank you for the response but unfortunately, the pro
The whole purpose of Apache mailing lists is that the messages get indexed
all over the web so that discussions and questions/solutions can be
searched easily by google and other engines.
For this reason, and the messages being sent via email as Steve pointed
out, it's just not possible to retract
o have more partitions
> play with shuffle memory fraction
>
> in spark 1.6 cache vs shuffle memory fractions are adjusted automatically
>
> On 5 February 2016 at 23:07, Corey Nolet wrote:
>
>> I just recently had a discovery that my jobs were taking several hours to
>> compl
I just recently had a discovery that my jobs were taking several hours to
completely because of excess shuffle spills. What I found was that when I
hit the high point where I didn't have enough memory for the shuffles to
store all of their file consolidations at once, it could spill so many
times t
David,
Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?
On Tue, Jan 12, 2016 at 10:03 AM, David
wrote:
> Hi all,
>
> I'd like to share news of the recent release of a new Spark package, ROSE.
>
>
> ROSE is a Scala lib
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over th
Usually more information as to the cause of this will be found down in your
logs. I generally see this happen when an out of memory exception has
occurred for one reason or another on an executor. It's possible your
memory settings are too small per executor or the concurrent number of
tasks you ar
1) Spark only needs to shuffle when data needs to be partitioned around the
workers in an all-to-all fashion.
2) Multi-stage jobs that would normally require several map reduce jobs,
thus causing data to be dumped to disk between the jobs can be cached in
memory.
I've been using SparkConf on my project for quite some time now to store
configuration information for its various components. This has worked very
well thus far in situations where I have control over the creation of the
SparkContext & the SparkConf.
I have run into a bit of a problem trying to i
I can only give you a general overview of how the Yarn integration
works from the Scala point of view.
Hope this helps.
> Yarn related logs can be found in RM ,NM, DN, NN log files in detail.
>
> Thanks again.
>
> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet wrote:
>
>>
Elkhan,
What does the ResourceManager say about the final status of the job? Spark
jobs that run as Yarn applications can fail but still successfully clean up
their resources and give them back to the Yarn cluster. Because of this,
there's a difference between your code throwing an exception in a
we don't support union types). JSON doesn't have
>> differentiated data structures so we go with the one that gives you more
>> information when doing inference by default. If you pass in a schema to
>> JSON however, you can override this and have a JSON object parsed as a map.
&g
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for
some reason, the "inferSchema" tools in Spark SQL extracts the schema of
nested JSON objects as StructTypes.
This makes it really confusing when trying to rectify the object hierarchy
when I have maps because the Catalyst c
e chunking
of the data in the partition (fetching more than 1 record @ a time).
On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet wrote:
> I don't know exactly what's going on under the hood but I would not assume
> that just because a whole partition is not being pulled into memory @
I don't know exactly what's going on under the hood but I would not assume
that just because a whole partition is not being pulled into memory @ one
time that that means each record is being pulled at 1 time. That's the
beauty of exposing Iterators & Iterables in an API rather than collections-
the
I've seen a few places where it's been mentioned that after a shuffle each
reducer needs to pull its partition into memory in its entirety. Is this
true? I'd assume the merge sort that needs to be done (in the cases where
sortByKey() is not used) wouldn't need to pull all of the data into memory
at
If you use rdd.mapPartitions(), you'll be able to get a hold of the
iterators for each partiton. Then you should be able to do
iterator.grouped(size) on each of the partitions. I think it may mean you
have 1 element at the end of each partition that may have less than "size"
elements. If that's oka
I'm confused about this. The comment on the function seems to indicate
that there is absolutely no shuffle or network IO but it also states that
it assigns an even number of parent partitions to each final partition
group. I'm having trouble seeing how this can be guaranteed without some
data pass
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341
On Thu, Jun 18, 2015 at 7:55 PM, Du Li wrote:
> repartition() means coalesce(shuffle=false)
>
>
>
> On Thursday, June 18, 2015 4:07 PM, Corey Nolet
> wrote:
>
>
> Doesn't repart
Doesn't repartition call coalesce(shuffle=true)?
On Jun 18, 2015 6:53 PM, "Du Li" wrote:
> I got the same problem with rdd,repartition() in my streaming app, which
> generated a few huge partitions and many tiny partitions. The resulting
> high data skew makes the processing time of a batch unpre
aster("local")
>
>.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
>
> .launch();
>
> spark.waitFor();
>
>}
>
> }
>
> }
>
>
>
> On Wed, Jun 17, 2015 at 5:51 PM, Corey Nolet wrote:
>
>> An example of being able to do thi
An example of being able to do this is provided in the Spark Jetty Server
project [1]
[1] https://github.com/calrissian/spark-jetty-server
On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov
wrote:
> Hi all,
>
> Is there any way running Spark job in programmatic way on Yarn cluster
> without using
So I've seen in the documentation that (after the overhead memory is
subtracted), the memory allocations of each executor are as follows (assume
default settings):
60% for cache
40% for tasks to process data
Reading about how Spark implements shuffling, I've also seen it say "20% of
executor mem
I've become accustomed to being able to use system properties to override
properties in the Hadoop Configuration objects. I just recently noticed
that when Spark creates the Hadoop Configuraiton in the SparkContext, it
cycles through any properties prefixed with spark.hadoop. and add those
properti
fer cache and
> not ever touch spinning disk if it is a size that is less than memory
> on the machine.
>
> - Patrick
>
> On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote:
> > So with this... to help my understanding of Spark under the hood-
> >
> > Is this sta
b.com/apache/spark/pull/5403
>
>
>
> On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet wrote:
>
>> Is it possible to configure Spark to do all of its shuffling FULLY in
>> memory (given that I have enough memory to store all the data)?
>>
>>
>>
>>
>
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
yza wrote:
> Hi Corey,
>
> As of this PR https://github.com/apache/spark/pull/5297/files, this can
> be controlled with spark.yarn.submit.waitAppCompletion.
>
> -Sandy
>
> On Thu, May 28, 2015 at 11:48 AM, Corey Nolet wrote:
>
>> I am submitting jobs to my yar
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm
noticing the jvm that fires up to allocate the resources, etc... is not
going away after the application master and executors have been allocated.
Instead, it just sits there printing 1 second status updates to the
console. I
It does look the function that's executed is in the driver so doing an
Await.result() on a thread AFTER i've executed an action should work. Just
updating this here in case anyone has this question in the future.
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD cal
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD call. After the input format runs, I want to do some directory
cleanup and I want to block while I'm doing that. Is that something I can
do inside of this function? If not, where would I accomplish this on every
micr
A tad off topic, but could still be relevant.
Accumulo's design is a tad different in the realm of being able to shard
and perform set intersections/unions server-side (through seeks). I've got
an adapter for Spark SQL on top of a document store implementation in
Accumulo that accepts the push-dow
Giovanni,
The DAG can be walked by calling the "dependencies()" function on any RDD.
It returns a Seq containing the parent RDDs. If you start at the leaves
and walk through the parents until dependencies() returns an empty Seq, you
ultimately have your DAG.
On Sat, Apr 25, 2015 at 1:28 PM, Akhi
If you return an iterable, you are not tying the API to a compactbuffer.
Someday, the data could be fetched lazily and he API would not have to
change.
On Apr 23, 2015 6:59 PM, "Dean Wampler" wrote:
> I wasn't involved in this decision ("I just make the fries"), but
> CompactBuffer is designed fo
this with Spark Streaming but imagine it would also
> work. Have you tried this?
>
> Within a window you would probably take the first x% as training and
> the rest as test. I don't think there's a question of looking across
> windows.
>
> On Thu, Apr 2, 2015 at 12:
How hard would it be to expose this in some way? I ask because the current
textFile and objectFile functions are obviously at some point calling out
to a FileInputFormat and configuring it.
Could we get a way to configure any arbitrary inputformat / outputformat?
used Scalation for ARIMA models?
On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet wrote:
> Taking out the complexity of the ARIMA models to simplify things- I can't
> seem to find a good way to represent even standard moving averages in spark
> streaming. Perhaps it's my ignorance w
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.
On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet w
I want to use ARIMA for a predictive model so that I can take time series
data (metrics) and perform a light anomaly detection. The time series data
is going to be bucketed to different time units (several minutes within
several hours, several hours within several days, several days within
several
Spark uses a SerializableWritable [1] to java serialize writable objects.
I've noticed (at least in Spark 1.2.1) that it breaks down with some
objects when Kryo is used instead of regular java serialization. Though it
is wrapping the actual AccumuloInputFormat (another example of something
you may
I would do sum square. This would allow you to keep an ongoing value as an
associative operation (in an aggregator) and then calculate the variance &
std deviation after the fact.
On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang wrote:
> Hi,
>
>
>
> I have a DataFrame object and I want to do types
Given the following scenario:
dstream.map(...).filter(...).window(...).foreachrdd()
When would the onBatchCompleted fire?
Thanks for taking this on Ted!
On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu wrote:
> I have created SPARK-6085 with pull request:
> https://github.com/apache/spark/pull/4836
>
> Cheers
>
> On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet wrote:
>
>> +1 to a better def
me-consuming jobs. Imagine if there was an
> automatic partition reconfiguration function that automagically did that...
>
>
> On Tue, Feb 24, 2015 at 3:20 AM, Corey Nolet wrote:
>
>> I *think* this may have been related to the default memory overhead
>> setting being too lo
+1 to a better default as well.
We were working find until we ran against a real dataset which was much
larger than the test dataset we were using locally. It took me a couple
days and digging through many logs to figure out this value was what was
causing the problem.
On Sat, Feb 28, 2015 at 11:
tively be listening to a
> partition.
>
> Yes, my understanding is that multiple receivers in one group are the
> way to consume a topic's partitions in parallel.
>
> On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet wrote:
> > Looking @ [1], it seems to recommend pull f
Looking @ [1], it seems to recommend pull from multiple Kafka topics in
order to parallelize data received from Kafka over multiple nodes. I notice
in [2], however, that one of the createConsumer() functions takes a
groupId. So am I understanding correctly that creating multiple DStreams
with the s
:31 AM, Zhan Zhang
> wrote:
> > Currently in spark, it looks like there is no easy way to know the
> > dependencies. It is solved at run time.
> >
> > Thanks.
> >
> > Zhan Zhang
> >
> > On Feb 26, 2015, at 4:20 PM, Corey Nolet wrote:
> >
> > Ted. That one I know. It was the dependency part I was curious about
>
xt has this method:
>* Return information about what RDDs are cached, if they are in mem or
> on disk, how much space
>* they take, etc.
>*/
> @DeveloperApi
> def getRDDStorageInfo: Array[RDDInfo] = {
>
> Cheers
>
> On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet
.map().()
> rdd1.count
> future { rdd1.saveAsHasoopFile(...) }
> future { rdd2.saveAsHadoopFile(…)]
>
> In this way, rdd1 will be calculated once, and two saveAsHadoopFile will
> happen concurrently.
>
> Thanks.
>
> Zhan Zhang
>
>
>
> On Feb 26, 2015
d be the behavior and myself and all my coworkers
expected.
On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet wrote:
> I should probably mention that my example case is much over simplified-
> Let's say I've got a tree, a fairly complex one where I begin a series of
> jobs at the ro
partition of rdd1 even when the rest is ready.
>
> That is probably usually a good idea in almost all cases. That much, I
> don't know how hard it is to implement. But I speculate that it's
> easier to deal with it at that level than as a function of the
> dependency gr
and trigger the execution
> if there is no shuffle dependencies in between RDDs.
>
> Thanks.
>
> Zhan Zhang
> On Feb 26, 2015, at 1:28 PM, Corey Nolet wrote:
>
> > Let's say I'm given 2 RDDs and told to store them in a sequence file and
> they have the fo
I see the "rdd.dependencies()" function, does that include ALL the
dependencies of an RDD? Is it safe to assume I can say
"rdd2.dependencies.contains(rdd1)"?
On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet wrote:
> Let's say I'm given 2 RDDs and told to store t
Let's say I'm given 2 RDDs and told to store them in a sequence file and
they have the following dependency:
val rdd1 = sparkContext.sequenceFile().cache()
val rdd2 = rdd1.map()
How would I tell programmatically without being the one who built rdd1 and
rdd2 whether or not rdd2 depend
ll see tomorrow-
but i have a suspicion this may have been the cause of the executors being
killed by the application master.
On Feb 23, 2015 5:25 PM, "Corey Nolet" wrote:
> I've got the opposite problem with regards to partitioning. I've got over
> 6000 partitions for s
t; too few in the beginning, the problems seems to decrease. Also, increasing
> spark.akka.askTimeout and spark.core.connection.ack.wait.timeout
> significantly (~700 secs), the problems seems to almost disappear. Don't
> wont to celebrate yet, still long way left before the job complet
t;> fraction of the Executor heap will be used for your user code vs the
>> shuffle vs RDD caching with the spark.storage.memoryFraction setting.
>>
>> On Sat, Feb 21, 2015 at 2:58 PM, Petar Zecevic
>> wrote:
>>
>>>
>>> Could you try to
I'm experiencing the same issue. Upon closer inspection I'm noticing that
executors are being lost as well. Thing is, I can't figure out how they are
dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
allocated for the application. I was thinking perhaps it was possible that
a s
We've been using commons configuration to pull our properties out of
properties files and system properties (prioritizing system properties over
others) and we add those properties to our spark conf explicitly and we use
ArgoPartser to get the command line argument for which property file to
load.
Nevermind- I think I may have had a schema-related issue (sometimes
booleans were represented as string and sometimes as raw booleans but when
I populated the schema one or the other was chosen.
On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet wrote:
> Here are the results of a few different
Here are the results of a few different SQL strings (let's assume the
schemas are valid for the data types used):
SELECT * from myTable where key1 = true -> no filters are pushed to my
PrunedFilteredScan
SELECT * from myTable where key1 = true and key2 = 5 -> 1 filter (key2) is
pushed to my Prune
This doesn't seem to have helped.
On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust
wrote:
> Try using `backticks` to escape non-standard characters.
>
> On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet wrote:
>
>> I don't remember Oracle ever enforcing that I couldn&
I don't remember Oracle ever enforcing that I couldn't include a $ in a
column name, but I also don't thinking I've ever tried.
When using sqlContext.sql(...), I have a "SELECT * from myTable WHERE
locations_$homeAddress = '123 Elm St'"
It's telling me $ is invalid. Is this a bug?
Ok. I just verified that this is the case with a little test:
WHERE (a = 'v1' and b = 'v2')PrunedFilteredScan passes down 2 filters
WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0
filters
On Fri, Feb 13, 2015
tDate).toDate
> }.getOrElse()
> val end = filters.find {
> case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate
> }.getOrElse()
>
> ...
>
> Filters are advisory, so you can ignore ones that aren't start/end.
>
> Michael
>
> On
I have a temporal data set in which I'd like to be able to query using
Spark SQL. The dataset is actually in Accumulo and I've already written a
CatalystScan implementation and RelationProvider[1] to register with the
SQLContext so that I can apply my SQL statements.
With my current implementation
ng all the
data to a single partition (no matter what window I set) and it seems to
lock up my jobs. I waited for 15 minutes for a stage that usually takes
about 15 seconds and I finally just killed the job in yarn.
On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet wrote:
> So I tried this:
>
>
group should need to fit.
>
> On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet wrote:
>
>> Doesn't iter still need to fit entirely into memory?
>>
>> On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra
>> wrote:
>>
>>> rdd.mapPartitions { iter =
I was able to get this working by extending KryoRegistrator and setting the
"spark.kryo.registrator" property.
On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet wrote:
> I'm trying to register a custom class that extends Kryo's Serializer
> interface. I can
I'm trying to register a custom class that extends Kryo's Serializer
interface. I can't tell exactly what Class the registerKryoClasses()
function on the SparkConf is looking for.
How do I register the Serializer class?
Doesn't iter still need to fit entirely into memory?
On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra
wrote:
> rdd.mapPartitions { iter =>
> val grouped = iter.grouped(batchSize)
> for (group <- grouped) { ... }
> }
>
> On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet
I think the word "partition" here is a tad different than the term
"partition" that we use in Spark. Basically, I want something similar to
Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
want to run an algorithm that can be optimized by working on 30 people at a
time, I'd
I am able to get around the problem by doing a map and getting the Event
out of the EventWritable before I do my collect. I think I'll do that for
now.
On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet wrote:
> I am using an input format to load data from Accumulo [1] in to a Spark
> RD
I am using an input format to load data from Accumulo [1] in to a Spark
RDD. It looks like something is happening in the serialization of my output
writable between the time it is emitted from the InputFormat and the time
it reaches its destination on the driver.
What's happening is that the resul
Here's another lightweight example of running a SparkContext in a common
java servlet container: https://github.com/calrissian/spark-jetty-server
On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke
wrote:
> If you want to design something like Spark shell have a look at:
>
> http://zeppelin-project.
My mistake Marcello, I was looking at the wrong message. That reply was
meant for bo yang.
On Feb 4, 2015 4:02 PM, "Marcelo Vanzin" wrote:
> Hi Corey,
>
> On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet wrote:
> >> Another suggestion is to build Spark by yourself.
>
ith Spark 1.1 and earlier you'd get
>> Guava 14 from Spark, so still a problem for you).
>>
>> Right now, the option Markus mentioned
>> (spark.yarn.user.classpath.first) can be a workaround for you, since
>> it will place your app's jars before Yarn's on the classpath.
>>
>
.org/jira/browse/SPARK-2996 - only works for YARN).
>> Also thread at
>> http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html
>> .
>>
>> HTH,
>> Markus
>>
>> On 02/03/2015 11:20 PM, Corey Nolet wrot
I'm having a really bad dependency conflict right now with Guava versions
between my Spark application in Yarn and (I believe) Hadoop's version.
The problem is, my driver has the version of Guava which my application is
expecting (15.0) while it appears the Spark executors that are working on
my R
We have a series of spark jobs which run in succession over various cached
datasets, do small groups and transforms, and then call
saveAsSequenceFile() on them.
Each call to save as a sequence file appears to have done its work, the
task says it completed in "xxx.x seconds" but then it pauses
e/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala
On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet wrote:
> I'm looking @ the ShuffledRDD code and it looks like there is a method
> setKeyOrdering()- is this guaranteed to order everything in the partition?
> I'm on S
I'm looking @ the ShuffledRDD code and it looks like there is a method
setKeyOrdering()- is this guaranteed to order everything in the partition?
I'm on Spark 1.2.0
On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet wrote:
> In all of the soutions I've found thus far, sorting h
y-spark-one-spark-job
>
> On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet wrote:
>
>> I need to be able to take an input RDD[Map[String,Any]] and split it into
>> several different RDDs based on some partitionable piece of the key
>> (groups) and then send each partition to
51 AM, Corey Nolet wrote:
> I need to be able to take an input RDD[Map[String,Any]] and split it into
> several different RDDs based on some partitionable piece of the key
> (groups) and then send each partition to a separate set of files in
> different folders in HDFS.
>
> 1
I need to be able to take an input RDD[Map[String,Any]] and split it into
several different RDDs based on some partitionable piece of the key
(groups) and then send each partition to a separate set of files in
different folders in HDFS.
1) Would running the RDD through a custom partitioner be the
I've read that this is supposed to be a rather significant optimization to
the shuffle system in 1.1.0 but I'm not seeing much documentation on
enabling this in Yarn. I see github classes for it in 1.2.0 and a property
"spark.shuffle.service.enabled" in the spark-defaults.conf.
The code mentions t
1 - 100 of 142 matches
Mail list logo