How big is the metrics_moveing_detection_cube table?
On Thu, Oct 13, 2016 at 8:51 PM -0700, "Lantao Jin"
mailto:jinlan...@gmail.com>> wrote:
sqlContext <- sparkRHive.init(sc)
sqlString<-
"SELECT
key_id,
rtl_week_beg_dt rawdate,
gmv_plan_rate_amt value
FROM
metrics_moveing_detection_cube
"
df
hi there,
I'm new bee for Spark, recently beginning my learning journey come with
spark 2.0.1. I hit an issue maybe totally simple. When trying to run
SparkPi example in Scala in following command, an exception was thrown. Is
it right behavior or something wrong in my command?
# bin/spark-submit
It sounds like mapPartitionsWithIndex will give you the information you
want over flatMap.
On Thursday, October 13, 2016, Shushant Arora
wrote:
> Hi
>
> I have a transformation on a pair rdd using flatmap function.
>
> 1.Can I detect in flatmap whether the current record is last record of
> part
sqlContext <- sparkRHive.init(sc)
sqlString<-
"SELECT
key_id,
rtl_week_beg_dt rawdate,
gmv_plan_rate_amt value
FROM
metrics_moveing_detection_cube
"
df <- sql(sqlString)
rdd<-SparkR:::toRDD(df)
#hang on case one: take from rdd
#take(rdd,3)
#hang on case two: convert back to dataframe
#df1<-create
Hi
I have a transformation on a pair rdd using flatmap function.
1.Can I detect in flatmap whether the current record is last record of
partition being processed and
2. what is the partition index of this partition.
public Iterable> call(Tuple2 t)
throws Exception {
//whether element is last ele
I have a use case where I want to build a dataset based off of
conditionally available data. I thought I'd do something like this:
case class SomeData( ... ) // parameters are basic encodable types like
strings and BigDecimals
var data = spark.emptyDataset[SomeData]
// loop, determining what dat
It seems like this is a real issue, so I've opened an issue:
https://issues.apache.org/jira/browse/SPARK-17928
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-way-to-set-mesos-cluster-driver-memory-overhead-tp27897p27901.html
Sent from the Apache Spark Us
In Spark 1.5.2 I had a job that reads from textFile and saves some data
into a Parquet table. One value was of type `ArrayList` being
successfully saved as an "array" column in the Parquet table. I
upgraded to Spark version 2.0.1, I changed the necessary code (SparkConf to
SparkSession, DataFrame
I have a python script that is used to submit spark jobs using the
spark-submit tool. I want to execute the command and write the output both
to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu
server.
This is what I have so far in my SubmitJob.py script
#!/usr/bin/python
# Sub
Actually each element of mapwithstate has a time out component. You can write
a function to "treat" your time out.
You can match it with your batch size and do fun stuff when the batch ends.
People do session management with the same approach.
When activity is registered the session is refreshed,
StateSpec has a method numPartitions to set the initial number of partition.
That should do the trick.
...Manas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Re-partitioning-mapwithstateDstream-tp27880p27899.html
Sent from the Apache Spark User List mail
Awesome, good points everyone. The ranking of the issues is super useful
and I'd also completely forgotten about the lack of built in UDAF support
which is rather important. There is a PR to make it easier to call/register
JVM UDFs from Python which will hopefully help a bit there too. I'm getting
We see users run both in the dispatcher and marathon. I generally prefer
marathon, because there's a higher likelihood it's going to have some
feature you need that the dispatcher lacks (like in this case).
It doesn't look like we support overhead for the driver.
On Thu, Oct 13, 2016 at 10:42 AM
On Thu, Oct 13, 2016 at 1:42 PM, drewrobb wrote:
> When using spark on mesos and deploying a job in cluster mode using
> dispatcher, there appears to be no memory overhead configuration for the
> launched driver processes ("--driver-memory" is the same as Xmx which is
> the
> same as the memory q
It doesn't look like we are. Can you file a JIRA? A workaround is to set
spark.mesos.executor.overhead to be at least spark.memory.offheap.size.
This is how the container is sized:
https://github.com/apache/spark/blob/master/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSched
When using spark on mesos and deploying a job in cluster mode using
dispatcher, there appears to be no memory overhead configuration for the
launched driver processes ("--driver-memory" is the same as Xmx which is the
same as the memory quota). This makes it almost a guarantee that a long
running d
Hi,
On 10/13/2016 04:35 PM, Cody Koeninger wrote:
So I see in the logs that PIDRateEstimator is choosing a new rate, and
the rate it's choosing is 100.
But it's always choosing 100, while all the other variables change (processing
time, latestRate, etc.) change.
Also, the records per batch is
Hi,all
I'm using Spark 2.0.0 to train a model with 1000w+ parameters, about 500GB
data. The treeAggregate is used to aggregate the gradient, when I set the
depth = 2 or 3, it works, and depth equals to 3 is faster.
So I set depth = 4 to obtain better performance, but now some executors
will be OOM
So I see in the logs that PIDRateEstimator is choosing a new rate, and
the rate it's choosing is 100.
That happens to be the default minimum of an (apparently undocumented) setting,
spark.streaming.backpressure.pid.minRate
Try setting that to 1 and see if there's different behavior.
BTW, how ma
Hi there,
I opened a question on StackOverflow at this link:
http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972
I didn’t get any useful answer, so I’m writing here hoping that someone can
As Sean said, it's unreleased. If you want to try it out, build spark
http://spark.apache.org/docs/latest/building-spark.html
The easiest way to include the jar is probably to use mvn install to
put it in your local repository, then link it in your application's
mvn or sbt build file as describe
Hi,
I am trying to understand how mesos allocate memory when offheap is enabled
but it seems that the framework is only taking the heap + 400 MB overhead
into consideration for resources allocation.
Example: spark.executor.memory=3g spark.memory.offheap.size=1g ==> mesos
report 3.4g allocated for t
Finally I found the solution!
I have changed the Python's directory settings as below:
import os
import sys
os.chdir(*"C:\Python27"*)
os.curdir
and it works like a charm :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-proble
Hi,
We have a spark cluster and we wanted to add some security for it. I was
looking at the documentation (in
http://spark.apache.org/docs/latest/security.html) and had some questions.
1. Do all executors listen by the same blockManager port? For example, in
yarn there are multiple execu
I think security has nothing to do with what API you use, spark sql or RDD
API.
Assuming you're running on yarn cluster (that is the only cluster manager
supports Kerberos currently).
Firstly you need to get Kerberos tgt in your local spark-submit process,
after being authenticated by Kerberos, S
The problem happens when writting (reading works fine)
rdd.saveAsNewAPIHadoopFile
We use just RDD and HDFS, no other things.
Spark 1.6.1 version.
`Claster A` - CDH 5.7.1
`Cluster B` - vanilla hadoop 2.6.5
`Cluster C` - CDH 5.8.0
Best regards,
Denis
On 13 October 2016 at 13:06, ayan guha wrote:
Apologies, oversight, I had a mix of mllib and ml imports.
On Thu, Oct 13, 2016 at 2:27 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:
> Hello,
>
> How do I create a row matrix from a dense vector. The following code,
> doesn't compile.
>
> val features = df.rdd.map(r => Vectors.den
Hello,
How do I create a row matrix from a dense vector. The following code,
doesn't compile.
val features = df.rdd.map(r =>
Vectors.dense(r.getAs[Double]("constant"),
r.getAs[Double]("sqft_living")))
val rowMatrix = new RowMatrix(features, features.count(), 2)
The compiler error
Error:(24, 33)
Hello,
My task is updating a dataframe in a while loop until there is no more data
to update.
The spark SQL I used is like below
val hc = sqlContext
hc.sql("use person")
var temp_pair = hc.sql("""
select ROW_NUMBER() OVER (ORDER B
And a little more details on Spark version, hadoop version and distribution
would also help...
On Thu, Oct 13, 2016 at 9:05 PM, ayan guha wrote:
> I think one point you need to mention is your target - HDFS, Hive or Hbase
> (or something else) and which end points are used.
>
> On Thu, Oct 13, 2
I think one point you need to mention is your target - HDFS, Hive or Hbase
(or something else) and which end points are used.
On Thu, Oct 13, 2016 at 8:50 PM, dbolshak wrote:
> Hello community,
>
> We've a challenge and no ideas how to solve it.
>
> The problem,
>
> Say we have the following env
Hello community,
We've a challenge and no ideas how to solve it.
The problem,
Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our sp
Hello community,
We've a challenge and no ideas how to solve it.
The problem,
Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spa
All nodes of my YARN cluster is running on Java 7, but I submit the job
from a Java 8 client.
I realised I run the job in yarn cluster mode and that's why setting '
--driver-java-options' is effective. Now the problem is, why submitting a
job from a Java 8 client to a Java 7 cluster causes a PermG
This partially answers the question: http://stackoverflow.com/a/35449563/604041
On 10/04/2016 03:10 PM, Samy Dindane wrote:
Hi,
I have the following schema:
-root
|-timestamp
|-date
|-year
|-month
|-day
|-some_column
|-some_other_column
I'd like to achieve either of these:
1) Us
Hey Cody,
Thanks for the reply. Really helpful.
Following your suggestion, I set spark.streaming.backpressure.enabled to true
and maxRatePerPartition to 10.
I know I can handle 100k records at the same time, but definitely not in 1
second (the batchDuration), so I expect the backpressure t
You can specify it; it just doesn't do anything but cause a warning in Java
8. It won't work in general to have such a tiny PermGen. If it's working it
means you're on Java 8 because it's ignored. You should set MaxPermSize if
anything, not PermSize. However the error indicates you are not using Ja
Solved the problem by specifying the PermGen size when submitting the job
(even to just a few MB).
Seems Java 8 has removed the Permanent Generation space, thus corresponding
JVM arguments are ignored. But I can still use --driver-java-options
"-XX:PermSize=80M -XX:MaxPermSize=100m" to specify th
The error doesn't say you're out of memory, but says you're out of PermGen.
If you see this, you aren't running Java 8 AFAIK, because 8 has no PermGen.
But if you're running Java 7, and you go investigate what this error means,
you'll find you need to increase PermGen. This is mentioned in the Spar
add --jars /spark-streaming-kafka_2.10-1.5.1.jar
(may need to download the jar file or any newer version)
to spark-shell.
I also have spark-streaming-kafka-assembly_2.10-1.6.1.jar as well on --jar
list
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gB
Hi,
I have a problem when running Spark SQL by PySpark on Java 8. Below is the
log.
16/10/13 16:46:40 INFO spark.SparkContext: Starting job: sql at
NativeMethodAccessorImpl.java:-2
Exception in thread "dag-scheduler-event-loop"
java.lang.OutOfMemoryError: PermGen space
at java.lang.Class
I don't believe that's been released yet. It looks like it was merged into
branches about a week ago. You're looking at unreleased docs too - have a
look at http://spark.apache.org/docs/latest/ for the latest released docs.
On Thu, Oct 13, 2016 at 9:24 AM JayKay wrote:
> I want to work with the
I want to work with the Kafka integration for structured streaming. I use
Spark version 2.0.0. and I start the spark-shell with:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
As described here:
https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-
*ANDREW! thank you. The code worked, Youre a legend. I was going to
register today and now saved **€**€**€. Owe you a beer*
*Gregory*
2016-10-12 10:04 GMT+09:00 Andrew James :
> Hey, I just found a promo code for Spark Summit Europe that saves 20%.
> It’s "Summit16" - I love Brussels and jus
44 matches
Mail list logo