I'm using pyspark dataframe api to sort by specific column and then saving
the dataframe as parquet file. But the resulting parquet file doesn't seem
to be sorted.
Applying sort and doing a head() on the results shows the correct results
sorted by 'value' column in desc order, as shown below:
~~~
Hi,
I have a RDD built during a spark streaming job and I'd like to join it to a
DataFrame (E/S input) to enrich it.
It seems that I can't join the RDD and the DF without converting first the RDD
to a DF (Tell me if I'm wrong). Here are the schemas of both DF :
scala> df
res32: org.apache.spark
Hi,
I'm trying to rename an orc table (either in hive or spark has no
difference). After that, all the content in the table will be invisible in
spark while it is still available in hive. The problem could alway be
recreated by very simple steps:
spark shell output--
I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one. All
of them have 6474 tasks. 5th task is a count operations and it also
performs aggregateByKey as a part of it lazy evaluation.
I am setting:
spark.driver.memory=10G, spark.yarn.am.memory=2G and
spark.driver.maxResultSize=9G
Driver maintains the complete metadata of application ( scheduling of
executor and maintaining the messaging to control the execution )
This code seems to be failing in that code path only. With that said there
is Jvm overhead based on num of executors , stages and tasks in your app.
Do you know yo
How big is your file and can you also share the code snippet
On Saturday, May 7, 2016, Johnny W. wrote:
> hi spark-user,
>
> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
> dataframe from a parquet data source with a single parquet file, it yields
> a stage with lots of sma
hi spark-user,
I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
dataframe from a parquet data source with a single parquet file, it yields
a stage with lots of small tasks. It seems the number of tasks depends on
how many executors I have instead of how many parquet files/partit
Hi Simon,
Thanks. I did actually have "SPARK_WORKER_CORES=8" in spark-env.sh - its
commented as 'to set the number of cores to use on this machine'.
Not sure how this would interplay with SPARK_EXECUTOR_INSTANCES and
SPARK_EXECUTOR_CORES, but I removed it and still see no scaleup with
increasing
Right but this logs from spark driver and spark driver seems to use Akka.
ERROR [sparkDriver-akka.actor.default-dispatcher-17]
akka.actor.ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
ActorSystem [sparkDriver]
I saw following
bq. at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
It was Akka which uses JavaSerializer
Cheers
On Sat, May 7, 2016 at 11:13 AM, Nirav Patel wrote:
> Hi,
>
> I thought I was using kryo serializer for shuffle. I could verify it from
> spark UI - Environment tab that
> sp
Hi,
I thought I was using kryo serializer for shuffle. I could verify it from
spark UI - Environment tab that
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator com.myapp.spark.jobs.conf.SparkSerializerRegistrator
But when I see following error in Driver logs it
Hi Ted
Following is my use case.
I have a prediction algorithm where i need to update some records to
predict the target.
For eg.
I have an eq. Y= mX +c
I need to change value of Xi of some records and calculate sum(Yi) if the
value of prediction is not close to target value then repeat the pro
Hi,
What is the easiest way of finding max(price) in code below
object CEP_AVG {
def main(args: Array[String]) {
// Create a local StreamingContext with two working thread and batch
interval of 10 seconds.
val sparkConf = new SparkConf().
setAppName("CEP_AVG").
setMast
Thanks Cody. It turns out that there was an even simpler explanation (the
flaw you pointed out was accurate too). I had mutable.Map instances being
passed where KafkaUtils wants immutable ones.
On Fri, May 6, 2016 at 8:32 AM, Cody Koeninger wrote:
> Look carefully at the error message, the typ
Check how much free memory you have on your hosr
/usr/bin/free
as a heuristic values start with these in
export SPARK_EXECUTOR_CORES=4 ##, Number of cores for the workers (Default:
1).
export SPARK_EXECUTOR_MEMORY=8G ## , Memory per Worker (e.g. 1000M, 2G)
(Default: 1G)
export SPARK_DRIVER_MEMOR
Hello,
Is there a way to instruct treeReduce() to reduce RDD partitions on the
same node locally?
In my case, I'm using treeReduce() to reduce map results in parallel. My
reduce function is just arithmetically adding map results (i.e. no notion
of aggregation by key). As far as I understand, a sh
Hi,
I'm running spark 1.6.1 on a single machine, initially a small one (8 cores,
16GB ram) using "--master local[*]" to spark-submit and I'm trying to see
scaling with increasing cores, unsuccessfully.
Initially I'm setting SPARK_EXECUTOR_INSTANCES=1, and increasing cores for
each executor. Th
Hi Divya,
I haven't actually used the package yet, but maybe you should check out the
gitter-room where the creator is quite active. You can find it on
https://gitter.im/FRosner/drunken-data-quality .
There you should be able to get the information you need.
Best,
Rick
On 6 May 2016 12:34, "Div
18 matches
Mail list logo