date:20151124

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳

OK, yarn.scheduler.maximum-allocation-mb is 16384. I have ran it again, the command to run it is: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster - -driver-memory 4g --executor-memory 8g lib/spark-examples*.jar 200 > > > 15/11/24 16:15:56 INFO yarn.Applicatio

Hi, how can I use BigInt/Long as type of size in Vector.sparse?

2015-11-24 Thread 吴承霖

Hi all, I have to deal with a lot of data, and I use spark for months. Now I try to use Vectors.sparse to generate a large vector of features, but the feature size may exceed 4 billion, above max of int, so I want to use BigInt or Long type to deal with it. But I read code and document that V

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Sabarish Sasidharan

If yarn has only 50 cores then it can support max 49 executors plus 1 driver application master. Regards Sab On 24-Nov-2015 1:58 pm, "谢廷稳" wrote: > OK, yarn.scheduler.maximum-allocation-mb is 16384. > > I have ran it again, the command to run it is: > ./bin/spark-submit --class org.apache.spark.

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao

Did you set this configuration "spark.dynamicAllocation.initialExecutors" ? You can set spark.dynamicAllocation.initialExecutors 50 to take try again. I guess you might be hitting this issue since you're running 1.5.0, https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot explain

Getting ParquetDecodingException when I am running my spark application from spark-submit

2015-11-24 Thread Kapil Raaj

The relevant error lines are: Caused by: parquet.io.ParquetDecodingException: Can't read value in column [roll_key] BINARY at value 19600 out of 4814, 19600 out of 19600 in currentPage. repetition level: 0, definition level: 1 Caused by: org.apache.spark.SparkException: Job aborted due to stage fa

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳

@Sab Thank you for your reply, but the cluster has 6 nodes which contain 300 cores and Spark application did not request resource from YARN. @SaiSai I have ran it successful with " spark.dynamicAllocation.initialExecutors" equals 50, but in http://spark.apache.org/docs/latest/configuration.html#d

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao

The document is right. Because of a bug introduce in https://issues.apache.org/jira/browse/SPARK-9092 which makes this configuration fail to work. It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you could change to newer version of Spark. On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳 wr

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sean Owen

Not sure who generally handles that, but I just made the edit. On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > Sorry to be a nag, I realize folks with edit rights on the Powered by Spark > page are very busy people, but its been 10 days since my original request, > was wondering if maybe it j

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Reynold Xin

I just updated the page to say "email dev" instead of "email user". On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen wrote: > Not sure who generally handles that, but I just made the edit. > > On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > > Sorry to be a nag, I realize folks with edit rights o

indexedrdd and radix tree: how to search indexedRDD using all prefixes?

2015-11-24 Thread Mina

Hello, I have a question about radix tree (PART) implementation in Spark, IndexedRDD. I explored the source code and found out that the Radix tree used in IndexedRDD, only returns exact matches. However, it seems to have an restricted use, For example, I want to find children nodes using prefix fro

Re: indexedrdd and radix tree: how to search indexedRDD using all prefixes?

2015-11-24 Thread Mina

This is what a Radix tree returns -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/indexedrdd-and-radix-tree-how-to-search-indexedRDD-using-all-prefixes-tp25459p25460.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

[streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread ponkin

HI, When I create stream with KafkaUtils.createDirectStream I can explicitly define the position "largest" or "smallest" - where to read topic from. What if I have previous checkpoints( in HDFS for example) with offsets, and I want to start reading from the last checkpoint? In source code of Kaf

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳

Thank you very much, after change to newer version, it did work well! 2015-11-24 17:15 GMT+08:00 Saisai Shao : > The document is right. Because of a bug introduce in > https://issues.apache.org/jira/browse/SPARK-9092 which makes this > configuration fail to work. > > It is fixed in https://issues

Re: Spark Expand Cluster

2015-11-24 Thread Dinesh Ranganathan

Thanks Christopher, I will try that. Dan On 20 November 2015 at 21:41, Bozeman, Christopher wrote: > Dan, > > > > Even though you may be adding more nodes to the cluster, the Spark > application has to be requesting additional executors in order to thus use > the added resources. Or the Spark

RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan

Cheng, That’s exactly what I was hoping for – native support for writing DateTime objects. As it stands Spark 1.5.2 seems to leave no option but to do manual conversion (to nanos, Timestamp, etc) prior to writing records to hive. Regards, Bryan Jeffrey Sent from Outlook Mail From: Cheng L

Re: DateTime Support - Hive Parquet

2015-11-24 Thread Cheng Lian

I see, then this is actually irrelevant to Parquet. I guess can support Joda DateTime in Spark SQL reflective schema inference to have this, provided that this is a frequent use case and Spark SQL already has Joda as a direct dependency. On the other hand, if you are using Scala, you can write

RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan

Cheng, I am using Scala. I have an implicit conversion from Joda DateTime to timestamp. My tables are defined with Timestamp. However explicit conversation appears to be required. Do you have an example of implicit conversion for this case? Do you convert on insert or on RDD to DF conversion?

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Deng Ching-Mallete

Hi, If you wish to read from checkpoints, you need to use StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to create the streaming context that you pass in to KafkaUtils.createDirectStream(...). You may refer to http://spark.apache.org/docs/latest/streaming-programming-guide.ht

Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar

Hi, I'm not able to build Spark 1.6 from source. Could you please share the steps to build Spark 1.16 Regards, Rajesh

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Понькин Алексей

Great, thank you. Sorry for being so inattentive) Need to read docs carefully. -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1&t=1 24.11.2015, 15:15, "Deng Ching-Mallete" : > Hi, > > If you wish to read from checkpoints, you need to use > StreamingContext.getOrCreate

Re: Spark 1.6 Build

2015-11-24 Thread Prem Sure

you can refer..: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I'm not able to build Spark 1.6 from source. Could you please

Experiences about NoSQL databases with Spark

2015-11-24 Thread sparkuser2345

I'm interested in knowing which NoSQL databases you use with Spark and what are your experiences. On a general level, I would like to use Spark streaming to process incoming data, fetch relevant aggregated data from the database, and update the aggregates in the DB based on the incoming records.

Re: Spark Kafka Direct Error

2015-11-24 Thread Cody Koeninger

No, the direct stream only communicates with Kafka brokers, not Zookeeper directly. It asks the leader for each topicpartition what the highest available offsets are, using the Kafka offset api. On Mon, Nov 23, 2015 at 11:36 PM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Does Kafka d

Spark SQL Save CSV with JSON Column

2015-11-24 Thread Ross.Cramblit

I am generating a set of tables in pyspark SQL from a JSON source dataset. I am writing those tables to disk as CSVs using df.write.format(com.databricks.spark.csv).save(…). I have a schema like: root |-- col_1: string (nullable = true) |-- col_2: string (nullable = true) |-- col_3: timestamp

Re: Experiences about NoSQL databases with Spark

2015-11-24 Thread Ted Yu

You should consider using HBase as the NoSQL database. w.r.t. 'The data in the DB should be indexed', you need to design the schema in HBase carefully so that the retrieval is fast. Disclaimer: I work on HBase. On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 wrote: > I'm interested in knowing wh

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sujit Pal

Thank you Sean, much appreciated. And yes, perhaps "email dev" is a better option since the traffic is (probably) lighter and these sorts of requests are more likely to get noticed. Although one would need to subscribe to the dev list to do that... -sujit On Tue, Nov 24, 2015 at 1:16 AM, Sean Ow

Re: Spark Kafka Direct Error

2015-11-24 Thread Cody Koeninger

Anything's possible, but that sounds pretty unlikely to me. Are the partitions it's failing for all on the same leader? Have there been any leader rebalances? Do you have enough log retention? If you log the offset for each message as it's processed, when do you see the problem? On Tue, Nov 24, 20

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy

Is it possible that the kafka offset api is somehow returning the wrong offsets. Because each time the job fails for different partitions with an error similar to the error that I get below. Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times, most recent failure: Lost task 20.

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy

I see the assertion error when I compare the offset ranges as shown below. How do I log the offset for each message? kafkaStream.transform { rdd => // Get the offset ranges in the RDD offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.foreachRDD { rdd => for (o <- offsetR

Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread jmvllt

Hi guys, This may be a stupid question. But I m facing an issue here. I found the class BinaryClassificationMetrics and I wanted to compute the aucROC or aucPR of my model. The thing is that the predict method of a LogisticRegressionModel only returns the predicted class, and not the probability

Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar

Hi Prem, Thank you for the details. I'm not able to build. I'm facing some issues. Any repository link, where I can download (preview version of) 1.6 version of spark-core_2.11 and spark-sql_2.11 jar files. Regards, Rajesh On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure wrote: > you can refer..:

Re: Spark 1.6 Build

2015-11-24 Thread Ted Yu

See: http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview&subj=+ANNOUNCE+Spark+1+6+0+Release+Preview On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Prem, > > Thank you for the details. I'm not able to build. I'm facing some issues. > > Any r

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread Sean Owen

Your reasoning is correct; you need probabilities (or at least some score) out of the model and not just a 0/1 label in order for a ROC / PR curve to have meaning. But you just need to call clearThreshold() on the model to make it return a probability. On Tue, Nov 24, 2015 at 5:19 PM, jmvllt wro

Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar

Hi Ted, I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above link. Regards, Rajesh On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu wrote: > See: > > http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview&subj=+ANNOUNCE+Spark+1+6+0+Release+Preview > > On Tue, Nov 24, 2015 a

Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-24 Thread Jeff Schecter

There is no codepath in the script /root/spark-ec2/spark/init.sh that can actually get to the version of spark 1.5.2 pre-built with Hadoop 2.6. I think the 2.4 version includes Hive as well... but setting hadoop major version to 2 won't actually get you there. Sigh. The documentation is the source

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch

HI Madabhattula Scala 2.11 requires building from source. Prebuilt binaries are available only for scala 2.10 >From the src folder: dev/change-scala-version.sh 2.11 Then build as you would normally either from mvn or sbt The above info *is* included in the spark docs but a little hard

Re: Spark 1.6 Build

2015-11-24 Thread Ted Yu

See also: https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/ w.r.t. building locally, please specify -Pscala-2.11 Cheers On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch wrote: > HI Madabhattula > Scala 2.11 requires bui

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch

thx for mentioning the build requirement But actually it is -*D*scala-2.11 (i.e. -D for java property instead of -P for profile) details: We can see this in the pom.xml scala-2.11 scala-2.11 2.11.7 2.11 So the scala-2.11 prof

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson

Hi Sabarish Thanks for the suggestion. I did not know about wholeTextFiles() By the way once your suggestion about repartitioning was spot on!. My run time for count() when from elapsed time:0:56:42.902407 to elapsed time:0:00:03.215143 on a data set of about 34M of 4720 records. Andy From: S

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson

Hi Don I went to a presentation given by Professor Ion Stoica. He mentioned that Python was a little slower in general because of the type system. I do not remember all of his comments. I think the context had to do with spark SQL and data frames. I wonder if the python issue is similar to the bo

Re: Spark SQL Save CSV with JSON Column

2015-11-24 Thread Davies Liu

I think you could have a Python UDF to turn the properties into JSON string: import simplejson def to_json(row): return simplejson.dumps(row.asDict(recursive=Trye)) to_json_udf = pyspark.sql.funcitons.udf(to_json) df.select("col_1", "col_2", to_json_udf(df.properties)).write.format("com.dat

Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Abhishek Anand

Hi , I need to get the batch time of the active batches which appears on the UI of spark streaming tab, How can this be achieved in Java ? BR, Abhi

Spark sql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append

2015-11-24 Thread Siva Gudavalli

Ref:https://issues.apache.org/jira/browse/SPARK-11953 In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc They are replaced with write.jdbc() in Spark 1.4.1 CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table followed by INSERT (DML) InsertIntoJDBC

RE: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-24 Thread Shuai Zheng

Hi All, Hi Just an update on this case. I try many different combination on settings (and I just upgrade to latest EMR 4.2.0 with Spark 1.5.2). I just found out that the problem is from: spark-submit --deploy-mode client --executor-cores=24 --driver-memory=5G --executor-memory=45G

Adding more slaves to a running cluster

2015-11-24 Thread Dillian Murphey

What's the current status on adding slaves to a running cluster? I want to leverage spark-ec2 and autoscaling groups. I want to launch slaves as spot instances when I need to do some heavy lifting, but I don't want to bring down my cluster in order to add nodes. Can this be done by just running

Re: RDD partition after calling mapToPair

2015-11-24 Thread trung kien

Thanks Cody for very useful information. It's much more clear to me now. I had a lots of wrong assumptions. On Nov 23, 2015 10:19 PM, "Cody Koeninger" wrote: > Partitioner is an optional field when defining an rdd. KafkaRDD doesn't > define one, so you can't really assume anything about the way

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist

Hi Abhi, You should be able to register a org.apache.spark.streaming.scheduler.StreamListener. There is an example here that may help: https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs here, http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SparkListe

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist

Hi Abhi, Sorry that was the wrong link should have been the StreamListener, http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/scheduler/StreamingListener.html The BatchInfo can be obtained from the event, for example: public void onBatchSubmitted(StreamingListenerBatchSubmi

java.io.IOException when using KryoSerializer

2015-11-24 Thread Piero Cinquegrana

Hello, I am using spark 1.4.1 with Zeppelin. When using the kryo serializer, spark.serializer = org.apache.spark.serializer.KryoSerializer instead of the default Java serializer I am getting the following error. Is this a known issue? Thanks, Piero java.io.IOException: Failed to connect to

Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming?

2015-11-24 Thread SRK

Hi, Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming? We currently use Kafka Direct for Spark Streaming and it seems to be failing out when there is a leader loss and we can't really guarantee that there won't be any leader loss due rebalancing. If w

Re: Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming?

2015-11-24 Thread Cody Koeninger

The direct stream shouldn't silently lose data in the case of a leader loss. Loss of a leader is handled like any other failure, retrying up to spark.task.maxFailures times. But really if you're losing leaders and taking that long to rebalance you should figure out what's wrong with your kaf

queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot

> > Hi, As a beginner ,I have below queries on Spork(Pig on Spark). I have cloned git clone https://github.com/apache/pig -b spark . 1.On which version of Pig and Spark , Spork is being built ? 2. I followed the steps mentioned in https://issues.apache.org/ji ra/browse/PIG-4059 and try to run

Re: queries on Spork (Pig on Spark)

2015-11-24 Thread Jeff Zhang

>>> Details at logfile: /home/pig/pig_1448425672112.log You need to check the log file for details On Wed, Nov 25, 2015 at 1:57 PM, Divya Gehlot wrote: > Hi, > > > As a beginner ,I have below queries on Spork(Pig on Spark). > I have cloned git clone https://github.com/apache/pig -b spark .

Re: queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot

Log files content : Pig Stack Trace --- ERROR 2998: Unhandled internal error. Could not initialize class org.apache.spark.rdd.RDDOperationScope$ java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ at org.apache.spark.SparkContext.withScope

Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread AlexG

I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster with 16.73 Tb storage, using distcp. The dataset is a collection of tar files of about 1.7 Tb each. Nothing else was stored in the HDFS, but after completing the download, the namenode page says that 11.59 Tb are in use. W

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers

what is your hdfs replication set to? On Wed, Nov 25, 2015 at 1:31 AM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 > cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of about 1.7 Tb each. > Nothing else was stored i

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin

Hi AlexG: Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 11.4TB. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-e

Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread UMESH CHAUDHARY

Hi, I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in spark-shell. Now, I am experiencing reversed performance by Spark-Sql over Hive. By default Hive gives result back in 27 seconds for plain select * query on 1 GB dataset containing 3623203 records, while spark-sql gives back in

Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael

so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to set the output filenames. Thanks Burak :) -Michael On Mon, Nov 23, 2015, at 09:19 PM, Bura

Spark 1.4.2- java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-24 Thread Sahil Sareen

I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help. This is the exception that I am getting: [MySparkApplication] WARN : Failed to execute SQL statement select * from TableS s join TableC c on s.property = c.property from X YZ org.apache.spark.SparkException: Job aborted

Re: Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread Sabarish Sasidharan

First of all, select * is not a useful SQL to evaluate. Very rarely would a user require all 362K records for visual analysis. Second, collect() forces movement of all data from executors to the driver. Instead write it out to some other table or to HDFS. Also Spark is more beneficial when you ha

61 matches

Mail list logo