I am using a dataframe and has structure like this :
root
|-- orders: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- amount: double (nullable = true)
|||-- id: string (nullable = true)
|-- user: string (nullable = true)
|-- language: string (nul
I am just using the above example to understand how Spark handles partitions
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Consider an example where I have a cluster with 5 nodes and each node has 64
cores with 244 GB memory. I decide to run 3 executors on each node and set
executor-cores to 21 and executor memory of 80GB, so that each executor can
execute 21 tasks in parallel. Now consider that 315(63 * 5) partitions
Hello,
I have set spark.sql.autoBroadcastJoinThreshold=1GB and I am running the
spark job. However, my application is failing with:
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
So what I discovered was that if I write the table being joined to the disk
and then read it again Spark correctly broadcasts it. I think it is because
when Spark estimates the size of smaller table it estimates it incorrectly
to be much bigger that what it is and hence decides to do a SortMergeJoi
I have a small table well below 50 MB that I want to broadcast join with a
larger table. However, if I set spark.sql.autoBroadcastJoinThreshold to 100
MB spark still decides to do a SortMergeJoin instead of a broadcast join. I
have to set an explicit broadcast hint on the table for it to do the
bro
Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache
Spark programs in Scala. The problem is that once that cluster is destroyed
I lose all the notebooks on it. So over a period of time I have a lot of
notebooks that require to be manually exported into my local disk and from
t
I am not using pyspark. The job is written in Scala
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I am using spark.sparkContext.broadcast() to broadcast. Is this even true if
the memory on our machines is 244 Gb a 70 Gb variable can't be broadcasted
even with high network speed?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
--
Hello,
I have a 110 node cluster with each executor having 50 GB memory and I
want to broadcast a variable of 70GB with each machine have 244 GB of
memory. I am having difficulty doing that. I was wondering at what size is
it unwise to broadcast a variable. Is there a general rule of thumb?
-
Yes each of the executors have 60GB
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hello,
I have set the value of spark.sql.autoBroadcastJoinThreshold to a very
high value of 20 GB. I am joining a table that I am sure is below this
variable, however spark is doing a SortMergeJoin. If I set a broadcast hint
then spark does a broadcast join and job finishes much faster. However,
I find that if the input is a Set then Spark doesn't try to find an encoder
for the Set but at the same time if the output of a method is a Set it does
try to find an encoder and if not found errors out. My understanding is that
even the input set has to be transferred to the executors?
The first
Hello,
I am using Spark 2.2.2 with Scala 2.11.8. I wrote a short program
val spark = SparkSession.builder().master("local[4]").getOrCreate()
case class TestCC(i: Int, ss: Set[String])
import spark.implicits._
import spark.sqlContext.implicits._
val testCCDS = Seq(TestCC(1,Set("SS","Salil")),
14 matches
Mail list logo