Hello,
Please add a link in Spark Community page (
https://spark.apache.org/community.html)
To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/)
We're an active meetup group, unifying the local Spark user community, and
having regular meetups.
Thanks!
Romi K.
a/org/apache/spark/deploy/master/MasterSource.scala
What is the meaning of "waitingApps"?
And if the only place it's used is in "startExecutorsOnWorkers" where they
are filtered as "app.coresLeft > 0", shouldn't that also be filtered in the
reported me
take executor memory times spark.shuffle.memoryFraction
and divide the data so that each partition is less than the above
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Nov 18, 2015 at 2:09 PM, Tom Arnfeld wrote:
> Hi Romi,
>
> Thanks! Could you give me an indi
I had many issues with shuffles (but not this one exactly), and what
eventually solved it was to repartition to input into more parts. Have you
tried that?
P.S. not sure if related, but there's a memory leak in the shuffle mechanism
https://issues.apache.org/jira/browse/SPARK-11293
ay be a network timeout etc)
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das
wrote:
> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kun
If they have a problem managing memory, wouldn't there should be a OOM?
Why does AppClient throw a NPE?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das
wrote:
> Is that all you have in the executor logs? I suspect some of those jo
Have you read this?
https://spark.apache.org/docs/latest/monitoring.html
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Thu, Nov 5, 2015 at 2:08 PM, Yogesh Vyas wrote:
> Hi,
> How we can use JMX and JConsole to monitor our Spark applic
uce on dataFrame without causing it to load all data to
> driver program ?
>
> On Nov 4, 2015, at 12:34 PM, Romi Kuntsman wrote:
>
> I noticed that toJavaRDD causes a computation on the DataFrame, so is it
> considered an action, even though logically it's a transformat
I noticed that toJavaRDD causes a computation on the DataFrame, so is it
considered an action, even though logically it's a transformation?
On Nov 4, 2015 6:51 PM, "Aliaksei Tsyvunchyk"
wrote:
> Hello folks,
>
> Recently I have noticed unexpectedly big network traffic between Driver
> Program and
except "spark.master", do you have "spark://" anywhere in your code or
config files?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 2, 2015 at 11:27 AM, Balachandar R.A.
wrote:
>
> -- Forwarded message --
> From: "Bala
(SparkContext.scala:103)
at
org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.(SparkContext.scala:543)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
Th
thrown from PixelObject?
Are you running spark with master=local, so it's running inside your IDE
and you can see the errors from the driver and worker?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Thu, Oct 29, 2015 at 10:04 AM, Zhang, Jingyu
wrote:
> Thanks Romi,
>
&g
Did you try to cache a DataFrame with just a single row?
Do you rows have any columns with null values?
Can you post a code snippet here on how you load/generate the dataframe?
Does dataframe.rdd.cache work?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Thu, Oct 29, 2015 at 4:33
Hi,
If I understand correctly:
rdd1 contains keys (of type StringDate)
rdd2 contains keys and values
and rdd3 contains all the keys, and the values from rdd2?
I think you should make rdd1 and rdd2 PairRDD, and then use outer join.
Does that make sense?
On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu
, Sep 21, 2015 at 5:31 PM Cody Koeninger wrote:
> That isn't accurate, I think you're confused about foreach.
>
> Look at
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
>
>
> On Mon, Sep
foreach is something that runs on the driver, not the workers.
if you want to perform some function on each record from cassandra, you
need to do cassandraRdd.map(func), which will run distributed on the spark
workers
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Sep 21
sparkConext is available on the driver, not on executors.
To read from Cassandra, you can use something like this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Sep 21, 2015 at 2:27 PM
RDD is a set of data rows (in your case numbers), there is no meaning for
the order of the items.
What exactly are you trying to accomplish?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Sep 21, 2015 at 2:29 PM, Zhiliang Zhu
wrote:
> Dear ,
>
> I have took lots o
Hi all,
The number of partition greatly affect the speed and efficiency of
calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.
Too few partitions with large data cause OOM exceptions.
Too many partitions on small data cause a delay due to overhead.
How do you programmatically determin
://mesos.apache.org/documentation/latest/app-framework-development-guide/
>
> Thanks
> Best Regards
>
> On Mon, Aug 24, 2015 at 12:11 PM, Romi Kuntsman wrote:
>
>> Hi,
>> I have a spark standalone cluster with 100s of applications per day, and
>> it changes size (more or
Hello,
We had the same problem. I've written a blog post with the detailed
explanation and workaround:
http://labs.totango.com/spark-read-file-with-colon/
Greetings,
Romi K.
On Tue, Aug 25, 2015 at 2:47 PM Gourav Sengupta
wrote:
> I am not quite sure about this but should the notation not be
Hi,
I have a spark standalone cluster with 100s of applications per day, and it
changes size (more or less workers) at various hours. The driver runs on a
separate machine outside the spark cluster.
When a job is running and it's worker is killed (because at that hour the
number of workers is redu
tory of a particular partition
> help? For directory structure, check this out...
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>
>
> On Wed, Aug 19, 2015 at 8:18 PM, Romi Kuntsman wrote:
>
>> Hello,
>>
>> I have a Da
If you create a PairRDD from the DataFrame, using
dataFrame.toRDD().mapToPair(), then you can call
partitionBy(someCustomPartitioner) which will partition the RDD by the key
(of the pair).
Then the operations on it (like joining with another RDD) will consider
this partitioning.
I'm not sure that D
I had the exact same issue, and overcame it by overriding
NativeS3FileSystem with my own class, where I replaced the implementation
of globStatus. It's a hack but it works.
Then I set the hadoop config fs.myschema.impl to my class name, and
accessed the files through myschema:// instead of s3n://
Hello,
I have a DataFrame, with a date column which I want to use as a partition.
Each day I want to write the data for the same date in Parquet, and then
read a dataframe for a date range.
I'm using:
myDataframe.write().partitionBy("date").mode(SaveMode.Overwrite).parquet(parquetDir);
If I use
spark RDD not fit for this requirement?
>
> On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman wrote:
>
>> What the throughput of processing and for how long do you need to
>> remember duplicates?
>>
>> You can take all the events, put them in an RDD, group by the key
What the throughput of processing and for how long do you need to remember
duplicates?
You can take all the events, put them in an RDD, group by the key, and then
process each key only once.
But if you have a long running application where you want to check that you
didn't see the same value befor
Are you running the Spark cluster in standalone or YARN?
In standalone, the application gets the available resources when it starts.
With YARN, you can try to turn on the setting
*spark.dynamicAllocation.enabled*
See https://spark.apache.org/docs/latest/configuration.html
On Wed, Jul 22, 2015 at 2
Hi,
I tried to enable Master metrics source (to get number of running/waiting
applications etc), and connected it to Graphite.
However, when these are enabled, application metrics are also sent.
Is it possible to separate them, and send only master metrics without
applications?
I see that Master
Hi Tal,
I'm not sure there is currently a built-in function for it, but you can
easily define a UDF (user defined function) by extending
org.apache.spark.sql.api.java.UDF1, registering it
(sparkContext.udf().register(...)), and then use it inside your query.
RK.
On Tue, Jul 21, 2015 at 7:04 PM
Hello,
*TL;DR: task crashes with OOM, but application gets stuck in infinite loop
retrying the task over and over again instead of failing fast.*
Using Spark 1.4.0, standalone, with DataFrames on Java 7.
I have an application that does some aggregations. I played around with
shuffling settings, w
Actually there is already someone on Hadoop-Common-Dev taking care of
removing the old Guava dependency
http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201501.mbox/browser
https://issues.apache.org/jira/browse/HADOOP-11470
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
I have recently encountered a similar problem with Guava version collision
with Hadoop.
Isn't it more correct to upgrade Hadoop to use the latest Guava? Why are
they staying in version 11, does anyone know?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Jan 7, 2015 at
About version compatibility and upgrade path - can the Java application
dependencies and the Spark server be upgraded separately (i.e. will 1.1.0
library work with 1.1.1 server, and vice versa), or do they need to be
upgraded together?
Thanks!
*Romi Kuntsman*, *Big Data Engineer*
http
mory
map of 12 MB to disk (36 times so far)
14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 11 MB to disk (37 times so far)
14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task
'attempt_201411241250__m_00_90' to s3n://mybucket/mydir/outp
mory
map of 12 MB to disk (36 times so far)
14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 11 MB to disk (37 times so far)
14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task
'attempt_201411241250__m_00_90' to s3n://mybucket/mydir/outp
Hello,
Currently in Spark standalone console, I can only see how long the entire
job took.
How can I know how long it was in WAITING and how long in RUNNING, and also
when running, how much each of the jobs inside took?
Thanks,
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
? Missing feature? How do you deal with build-up of
temp files?
Thanks,
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
Let's say that I run Spark on Mesos in fine-grained mode, and I have 12
cores and 64GB memory.
I run application A on Spark, and some time after that (but before A
finished) application B.
How many CPUs will each of them get?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
O
I have a single Spark cluster, not multiple frameworks and not multiple
versions. Is it relevant for my use-case?
Where can I find information about exactly how to make Mesos tell Spark how
many resources of the cluster to use? (instead of the default take-all)
*Romi Kuntsman*, *Big Data Engineer
How can I configure Mesos allocation policy to share resources between all
current Spark applications? I can't seem to find it in the architecture
docs.
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Tue, Nov 4, 2014 at 9:11 AM, Akhil Das
wrote:
> Yes. i believe Meso
tart app A - it needs just 2 cores (as you said it will get even
when there are 12 available), but gets nothing
4 - Until I stop app B, app A is stuck waiting, instead of app B freeing 2
cores and dropping to 10 cores.
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Nov 3,
ssible to divide the resources between them, according to how many
are trying to run at the same time?
So for example if I have 12 cores - if one job is scheduled, it will get 12
cores, but if 3 are scheduled, then each one will get 4 cores and then will
all start.
Thanks!
*Romi Kuntsman*, *Big
y jobs
run together, and together lets them use all the available resources?
- How do you divide resources between applications on your usecase?
P.S. I started reading about Mesos but couldn't figure out if/how it could
solve the described issue.
Thanks!
*Romi Kuntsman*, *Big Data
master. Using Spark 1.1.0.
What if a master server is restarted, should worker retry to register on it?
Greetings,
--
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
​Join the Customer Success Manifesto <http://youtu.be/XvFi2Wh6wgU>
46 matches
Mail list logo