The in-process JNI only works out when the R process comes up first
and we launch a JVM inside it. In many deploy modes like YARN (or
actually in anything using spark-submit) the JVM comes up first and we
launch R after that. Using an inter-process solution helps us cover
both use cases
Thanks
Shi
why SparkR chose to uses inter-process socket solution eventually on driver
side instead of in-process JNI showed in one of its doc's below (about page
20)?
https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
Any reason why you need exactly a certain number of partitions?
One way we can make that work is for RangePartitioner to return a bunch of
empty partitions if the number of distinct elements is small. That would
require changing Spark.
If you want a quick work around, you can also append some ran
Hi,
I'm looking for open source workflow tools/engines that allow us to
schedule spark jobs on a cassandra cluster. Since there are tonnes of
alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted
to check with people here to see what they are using today.
Some of the requiremen
We could do that after 1.5 released, it will have same release cycle
as Spark in the future.
On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
wrote:
> +1 (once again :) )
>
> 2015-07-28 14:51 GMT+02:00 Justin Uang :
>>
>> // ping
>>
>> do we have any signoff from the pyspark devs to submit a PR
Hey all.
I was trying to understand Spark Internals by looking in to (and hacking)
the code. I was basically trying to explore the buckets which are generated
when we partition the output of each map task and then let the reduce side
fetch them on the basis of paritionId. I went into the write() m
Eron,
Thanks for sending out this list! We can make some of the critical ones
public for 1.5, but they will be marked DeveloperApi since they may require
changes in the future. Just made the JIRA: [
https://issues.apache.org/jira/browse/SPARK-9704] and I'll send a PR soon.
Joseph
On Mon, Aug 3
PythonRDD.scala has a number of PySpark specific conventions (for
example worker reuse, exceptions etc.) and PySpark specific protocols
(e.g. for communicating accumulators, broadcasts between the JVM and
Python etc.). While it might be possible to refactor the two classes
to share some more code I
Hello !
I think I found a performant and nice solution based on take' source code :
def exists[T](rdd: RDD[T])(qualif: T => Boolean, num: Int): Boolean = {
if (num == 0) {
true
} else {
var count: Int = 0
val totalParts: Int = rdd.partitions.length
var partsScanned: Int = 0
What is the JIRA number if a JIRA has been logged for this ?
Thanks
> On Jan 20, 2015, at 11:30 AM, Cheng Lian wrote:
>
> Hey Yi,
>
> I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like
> to investigate this issue later. Would you please open an JIRA for it? Thanks
On behalf of Renyi Xiong -
When reading Spark codebase, looks to me PythonRDD.scala is reusable, I
wonder why SparkR choose to implement its own RRDD.scala?
thanks
Daniel
11 matches
Mail list logo