OK, yarn.scheduler.maximum-allocation-mb is 16384.
I have ran it again, the command to run it is:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster -
-driver-memory 4g --executor-memory 8g lib/spark-examples*.jar 200
>
>
> 15/11/24 16:15:56 INFO yarn.Applicatio
Hi all,
I have to deal with a lot of data, and I use spark for months.
Now I try to use Vectors.sparse to generate a large vector of features, but the
feature size may exceed 4 billion, above max of int, so I want to use BigInt or
Long type to deal with it.
But I read code and document that V
If yarn has only 50 cores then it can support max 49 executors plus 1
driver application master.
Regards
Sab
On 24-Nov-2015 1:58 pm, "谢廷稳" wrote:
> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>
> I have ran it again, the command to run it is:
> ./bin/spark-submit --class org.apache.spark.
Did you set this configuration "spark.dynamicAllocation.initialExecutors" ?
You can set spark.dynamicAllocation.initialExecutors 50 to take try again.
I guess you might be hitting this issue since you're running 1.5.0,
https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
explain
The relevant error lines are:
Caused by: parquet.io.ParquetDecodingException: Can't read value in
column [roll_key] BINARY at value 19600 out of 4814, 19600 out of
19600 in currentPage. repetition level: 0, definition level: 1
Caused by: org.apache.spark.SparkException: Job aborted due to stage
fa
@Sab Thank you for your reply, but the cluster has 6 nodes which contain
300 cores and Spark application did not request resource from YARN.
@SaiSai I have ran it successful with "
spark.dynamicAllocation.initialExecutors" equals 50, but in
http://spark.apache.org/docs/latest/configuration.html#d
The document is right. Because of a bug introduce in
https://issues.apache.org/jira/browse/SPARK-9092 which makes this
configuration fail to work.
It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you could
change to newer version of Spark.
On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳 wr
Not sure who generally handles that, but I just made the edit.
On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote:
> Sorry to be a nag, I realize folks with edit rights on the Powered by Spark
> page are very busy people, but its been 10 days since my original request,
> was wondering if maybe it j
I just updated the page to say "email dev" instead of "email user".
On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen wrote:
> Not sure who generally handles that, but I just made the edit.
>
> On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote:
> > Sorry to be a nag, I realize folks with edit rights o
Hello, I have a question about radix tree (PART) implementation in Spark,
IndexedRDD.
I explored the source code and found out that the Radix tree used in
IndexedRDD, only returns exact matches. However, it seems to have an
restricted use,
For example, I want to find children nodes using prefix fro
This is what a Radix tree returns
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/indexedrdd-and-radix-tree-how-to-search-indexedRDD-using-all-prefixes-tp25459p25460.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
HI,
When I create stream with KafkaUtils.createDirectStream I can explicitly define
the position "largest" or "smallest" - where to read topic from.
What if I have previous checkpoints( in HDFS for example) with offsets, and I
want to start reading from the last checkpoint?
In source code of Kaf
Thank you very much, after change to newer version, it did work well!
2015-11-24 17:15 GMT+08:00 Saisai Shao :
> The document is right. Because of a bug introduce in
> https://issues.apache.org/jira/browse/SPARK-9092 which makes this
> configuration fail to work.
>
> It is fixed in https://issues
Thanks Christopher, I will try that.
Dan
On 20 November 2015 at 21:41, Bozeman, Christopher
wrote:
> Dan,
>
>
>
> Even though you may be adding more nodes to the cluster, the Spark
> application has to be requesting additional executors in order to thus use
> the added resources. Or the Spark
Cheng,
That’s exactly what I was hoping for – native support for writing DateTime
objects. As it stands Spark 1.5.2 seems to leave no option but to do manual
conversion (to nanos, Timestamp, etc) prior to writing records to hive.
Regards,
Bryan Jeffrey
Sent from Outlook Mail
From: Cheng L
I see, then this is actually irrelevant to Parquet. I guess can support
Joda DateTime in Spark SQL reflective schema inference to have this,
provided that this is a frequent use case and Spark SQL already has Joda
as a direct dependency.
On the other hand, if you are using Scala, you can write
Cheng,
I am using Scala. I have an implicit conversion from Joda DateTime to
timestamp. My tables are defined with Timestamp. However explicit conversation
appears to be required. Do you have an example of implicit conversion for this
case? Do you convert on insert or on RDD to DF conversion?
Hi,
If you wish to read from checkpoints, you need to use
StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to
create the streaming context that you pass in to
KafkaUtils.createDirectStream(...). You may refer to
http://spark.apache.org/docs/latest/streaming-programming-guide.ht
Hi,
I'm not able to build Spark 1.6 from source. Could you please share the
steps to build Spark 1.16
Regards,
Rajesh
Great, thank you.
Sorry for being so inattentive) Need to read docs carefully.
--
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1&t=1
24.11.2015, 15:15, "Deng Ching-Mallete" :
> Hi,
>
> If you wish to read from checkpoints, you need to use
> StreamingContext.getOrCreate
you can refer..:
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi,
>
> I'm not able to build Spark 1.6 from source. Could you please
I'm interested in knowing which NoSQL databases you use with Spark and what
are your experiences.
On a general level, I would like to use Spark streaming to process incoming
data, fetch relevant aggregated data from the database, and update the
aggregates in the DB based on the incoming records.
No, the direct stream only communicates with Kafka brokers, not Zookeeper
directly. It asks the leader for each topicpartition what the highest
available offsets are, using the Kafka offset api.
On Mon, Nov 23, 2015 at 11:36 PM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Does Kafka d
I am generating a set of tables in pyspark SQL from a JSON source dataset. I am
writing those tables to disk as CSVs using
df.write.format(com.databricks.spark.csv).save(…). I have a schema like:
root
|-- col_1: string (nullable = true)
|-- col_2: string (nullable = true)
|-- col_3: timestamp
You should consider using HBase as the NoSQL database.
w.r.t. 'The data in the DB should be indexed', you need to design the
schema in HBase carefully so that the retrieval is fast.
Disclaimer: I work on HBase.
On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345
wrote:
> I'm interested in knowing wh
Thank you Sean, much appreciated.
And yes, perhaps "email dev" is a better option since the traffic is
(probably) lighter and these sorts of requests are more likely to get
noticed. Although one would need to subscribe to the dev list to do that...
-sujit
On Tue, Nov 24, 2015 at 1:16 AM, Sean Ow
Anything's possible, but that sounds pretty unlikely to me.
Are the partitions it's failing for all on the same leader?
Have there been any leader rebalances?
Do you have enough log retention?
If you log the offset for each message as it's processed, when do you see
the problem?
On Tue, Nov 24, 20
Is it possible that the kafka offset api is somehow returning the wrong
offsets. Because each time the job fails for different partitions with an
error similar to the error that I get below.
Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times,
most recent failure: Lost task 20.
I see the assertion error when I compare the offset ranges as shown below.
How do I log the offset for each message?
kafkaStream.transform { rdd =>
// Get the offset ranges in the RDD
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.foreachRDD { rdd =>
for (o <- offsetR
Hi guys,
This may be a stupid question. But I m facing an issue here.
I found the class BinaryClassificationMetrics and I wanted to compute the
aucROC or aucPR of my model.
The thing is that the predict method of a LogisticRegressionModel only
returns the predicted class, and not the probability
Hi Prem,
Thank you for the details. I'm not able to build. I'm facing some issues.
Any repository link, where I can download (preview version of) 1.6
version of spark-core_2.11 and spark-sql_2.11 jar files.
Regards,
Rajesh
On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure wrote:
> you can refer..:
See:
http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview&subj=+ANNOUNCE+Spark+1+6+0+Release+Preview
On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi Prem,
>
> Thank you for the details. I'm not able to build. I'm facing some issues.
>
> Any r
Your reasoning is correct; you need probabilities (or at least some
score) out of the model and not just a 0/1 label in order for a ROC /
PR curve to have meaning.
But you just need to call clearThreshold() on the model to make it
return a probability.
On Tue, Nov 24, 2015 at 5:19 PM, jmvllt wro
Hi Ted,
I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
link.
Regards,
Rajesh
On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu wrote:
> See:
>
> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview&subj=+ANNOUNCE+Spark+1+6+0+Release+Preview
>
> On Tue, Nov 24, 2015 a
There is no codepath in the script /root/spark-ec2/spark/init.sh that can
actually get to the version of spark 1.5.2 pre-built with Hadoop 2.6. I
think the 2.4 version includes Hive as well... but setting hadoop major
version to 2 won't actually get you there.
Sigh. The documentation is the source
HI Madabhattula
Scala 2.11 requires building from source. Prebuilt binaries are
available only for scala 2.10
>From the src folder:
dev/change-scala-version.sh 2.11
Then build as you would normally either from mvn or sbt
The above info *is* included in the spark docs but a little hard
See also:
https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/
w.r.t. building locally, please specify -Pscala-2.11
Cheers
On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch wrote:
> HI Madabhattula
> Scala 2.11 requires bui
thx for mentioning the build requirement
But actually it is -*D*scala-2.11 (i.e. -D for java property instead of
-P for profile)
details:
We can see this in the pom.xml
scala-2.11
scala-2.11
2.11.7
2.11
So the scala-2.11 prof
Hi Sabarish
Thanks for the suggestion. I did not know about wholeTextFiles()
By the way once your suggestion about repartitioning was spot on!. My run
time for count() when from elapsed time:0:56:42.902407 to elapsed
time:0:00:03.215143 on a data set of about 34M of 4720 records.
Andy
From: S
Hi Don
I went to a presentation given by Professor Ion Stoica. He mentioned that
Python was a little slower in general because of the type system. I do not
remember all of his comments. I think the context had to do with spark SQL
and data frames.
I wonder if the python issue is similar to the bo
I think you could have a Python UDF to turn the properties into JSON string:
import simplejson
def to_json(row):
return simplejson.dumps(row.asDict(recursive=Trye))
to_json_udf = pyspark.sql.funcitons.udf(to_json)
df.select("col_1", "col_2",
to_json_udf(df.properties)).write.format("com.dat
Hi ,
I need to get the batch time of the active batches which appears on the UI
of spark streaming tab,
How can this be achieved in Java ?
BR,
Abhi
Ref:https://issues.apache.org/jira/browse/SPARK-11953
In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc
They are replaced with write.jdbc() in Spark 1.4.1
CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table
followed by INSERT (DML)
InsertIntoJDBC
Hi All,
Hi Just an update on this case.
I try many different combination on settings (and I just upgrade to latest EMR
4.2.0 with Spark 1.5.2).
I just found out that the problem is from:
spark-submit --deploy-mode client --executor-cores=24 --driver-memory=5G
--executor-memory=45G
What's the current status on adding slaves to a running cluster? I want to
leverage spark-ec2 and autoscaling groups. I want to launch slaves as spot
instances when I need to do some heavy lifting, but I don't want to bring
down my cluster in order to add nodes.
Can this be done by just running
Thanks Cody for very useful information.
It's much more clear to me now. I had a lots of wrong assumptions.
On Nov 23, 2015 10:19 PM, "Cody Koeninger" wrote:
> Partitioner is an optional field when defining an rdd. KafkaRDD doesn't
> define one, so you can't really assume anything about the way
Hi Abhi,
You should be able to register a
org.apache.spark.streaming.scheduler.StreamListener.
There is an example here that may help:
https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs
here,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SparkListe
Hi Abhi,
Sorry that was the wrong link should have been the StreamListener,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/scheduler/StreamingListener.html
The BatchInfo can be obtained from the event, for example:
public void onBatchSubmitted(StreamingListenerBatchSubmi
Hello,
I am using spark 1.4.1 with Zeppelin. When using the kryo serializer,
spark.serializer = org.apache.spark.serializer.KryoSerializer
instead of the default Java serializer I am getting the following error. Is
this a known issue?
Thanks,
Piero
java.io.IOException: Failed to connect to
Hi,
Does receiver based approach lose any data in case of a leader/broker loss
in Spark Streaming? We currently use Kafka Direct for Spark Streaming and it
seems to be failing out when there is a leader loss and we can't really
guarantee that there won't be any leader loss due rebalancing.
If w
The direct stream shouldn't silently lose data in the case of a leader
loss. Loss of a leader is handled like any other failure, retrying
up to spark.task.maxFailures
times.
But really if you're losing leaders and taking that long to rebalance you
should figure out what's wrong with your kaf
>
> Hi,
As a beginner ,I have below queries on Spork(Pig on Spark).
I have cloned git clone https://github.com/apache/pig -b spark .
1.On which version of Pig and Spark , Spork is being built ?
2. I followed the steps mentioned in https://issues.apache.org/ji
ra/browse/PIG-4059 and try to run
>>> Details at logfile: /home/pig/pig_1448425672112.log
You need to check the log file for details
On Wed, Nov 25, 2015 at 1:57 PM, Divya Gehlot
wrote:
> Hi,
>
>
> As a beginner ,I have below queries on Spork(Pig on Spark).
> I have cloned git clone https://github.com/apache/pig -b spark .
Log files content :
Pig Stack Trace
---
ERROR 2998: Unhandled internal error. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
with 16.73 Tb storage, using
distcp. The dataset is a collection of tar files of about 1.7 Tb each.
Nothing else was stored in the HDFS, but after completing the download, the
namenode page says that 11.59 Tb are in use. W
what is your hdfs replication set to?
On Wed, Nov 25, 2015 at 1:31 AM, AlexG wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored i
Hi AlexG:
Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 =
11.4TB.
--
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-e
Hi,
I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in
spark-shell.
Now, I am experiencing reversed performance by Spark-Sql over Hive.
By default Hive gives result back in 27 seconds for plain select * query on
1 GB dataset containing 3623203 records, while spark-sql gives back in
so basically writing them into a temporary directory named with the
batch time and then move the files to their destination on success ? I
wished there was a way to skip moving files around and be able to set
the output filenames.
Thanks Burak :)
-Michael
On Mon, Nov 23, 2015, at 09:19 PM, Bura
I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help.
This is the exception that I am getting:
[MySparkApplication] WARN : Failed to execute SQL statement select *
from TableS s join TableC c on s.property = c.property from X YZ
org.apache.spark.SparkException: Job aborted
First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.
Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.
Also Spark is more beneficial when you ha
61 matches
Mail list logo