Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread shahab
Thank you all for the comments, but my problem still exists. @Dean,@Ewan yes, I do have hadoop file system installed and working @Sujit: the last version of EMR (version 4) does not need manual copying of jar file to the server. The blog that you pointed out refers to older version (3.x) of EMR.

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Sean Owen
I don't think this is Spark-specific. Mostly you need to escape / quote user-supplied values as with any SQL engine. On Thu, Sep 10, 2015 at 7:32 AM, V Dineshkumar wrote: > Hi, > > What is the preferred way of avoiding SQL Injection while using Spark SQL? > In our use case we have to take the par

Cassandra row count grouped by multiple columns

2015-09-10 Thread Chirag Dewan
Hi, I am using Spark 1.2.0 with Cassandra 2.0.14. I have a problem where I need a count of rows unique to multiple columns. So I have a column family with 3 columns i.e. a,b,c and for each value of distinct a1,b1,c1 I want the row count. For eg: A1,B1,C1 A2,B2,C2 A3,B3,C2 A1,B1,C1 The output

Custom UDAF Evaluated Over Window

2015-09-10 Thread xander92
While testing out the new UserDefinedAggregateFunction in Spark 1.5.0, I successfully implemented a simple function to compute an average. I then tried to test this function by applying it over a simple window and I got an error saying that my function is not supported over window operation. So, i

Re: Perf impact of BlockManager byte[] copies

2015-09-10 Thread Reynold Xin
This is one problem I'd like to address soon - providing a binary block management interface for shuffle (and maybe other things) that avoids serialization/copying. On Fri, Feb 27, 2015 at 3:39 PM, Paul Wais wrote: > Dear List, > > I'm investigating some problems related to native code integrat

Re: Can not allocate executor when running spark on mesos

2015-09-10 Thread Iulian Dragoș
On Thu, Sep 10, 2015 at 3:35 AM, canan chen wrote: > Finally got the answer. Actually it works fine. The allocation behavior > on mesos is a little different from yarn/standalone. Seems the executor in > mesos is lazily allocated (only when job is executed) while executor in > yarn/standalone is

Re: spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-10 Thread Marius Soutier
Found an issue for this: https://issues.apache.org/jira/browse/SPARK-10251 > On 09.09.2015, at 18:00, Marius Soutier wrote: > > Hi all, > > as indicated in the title, I’m using Kryo with a custom Kryo serializer, but > as soon as I enable `sp

spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Hi, I am using data generated with sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark sql performance (spark on yarn, with 10 nodes) with the following code (The table store_sales is about 90 million records, 6G in size) val outputDir="hdfs://tmp/spark_perf/scaleFact

Terasort on spark -- Jmeter

2015-09-10 Thread Shreeharsha G Neelakantachar
Hi, I am trying to simulate multiple users/threads for Terasort on spark1.4.1 as part of studying it's performance patterns. Please let me know of any Jmeter plugins available for same or should i use which protocol to record the Terasort execution flow ? Any steps on using Jmeter to si

Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Petr Novak
Hello, sqlContext.parquetFile(dir) throws exception " Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient" The strange thing is that on the second attempt to open the file it is successful: try { sqlContext.parquetFile(dir) } catch { case e: Exception => sqlCont

Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
Hi there, I have a problem with fulfilling all my needs when using Spark Streaming on Kafka. Let me enumerate my requirements: 1. I want to have at-least-once/exactly-once processing. 2. I want to have my application fault & simple stop tolerant. The Kafka offsets need to be tracked between restart

Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-10 Thread Petr Novak
Hello, my Spark streaming v1.3.0 code uses sys.ShutdownHookThread { ssc.stop(stopSparkContext = true, stopGracefully = true) } to use Ctrl+C in command line to stop it. It returned back to command line after it finished batch but it doesn't with v1.4.0-v.1.5.0. Was the behaviour or required cod

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Akhil Das
This consumer pretty much covers all those scenarios you listed github.com/dibbhatt/kafka-spark-consumer Give it a try. Thanks Best Regards On Thu, Sep 10, 2015 at 3:32 PM, Krzysztof Zarzycki wrote: > Hi there, > I have a problem with fulfilling all my needs when using Spark Streaming > on Kafk

Re: Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Cheng Lian
If you don't need to interact with Hive, you may compile Spark without using the -Phive flag to eliminate Hive dependencies. In this way, the sqlContext instance in Spark shell will be of type SQLContext instead of HiveContext. The reason behind the Hive metastore error is probably due to Hive

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
Thanks Akhil, seems like an interesting option to consider. Do you know if the package is production-ready? Do you use it in production? And do you know if it works for Spark 1.3.1 as well? README mentions that package in spark-packages.org is built with Spark 1.4.1. Anyway, it seems that core

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg
>> The whole point of checkpointing is to recover the *exact* computation where it left off. That makes sense. We were looking at the metadata checkpointing and the data checkpointing, and with data checkpointing, you can specify a checkpoint duration value. With the metadata checkpointing, there

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dibyendu Bhattacharya
Hi, This is being running in Production in many organization who has adopted this consumer as an alternative option. The Consumer will run with spark 1.3.1 . This is being running in Pearson for sometime in production. This is part of spark packages and you can see how to include it in your mvn

Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Sathish Kumaran Vairavelu
I guess data pump export from Oracle could be fast option. Hive now has oracle data pump serde.. https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm On Wed, Sep 9, 2015 at 4:41 AM Reynold Xin wrote: > Using the JDBC data source is probably the best way. > http://spark.apache.org/do

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg
>> checkpoints can't be used between controlled restarts Is that true? If so, why? From my testing, checkpoints appear to be working fine, we get the data we've missed between the time the consumer went down and the time we brought it back up. >> If I cannot make checkpoints between code upgrades

Random Forest MLlib

2015-09-10 Thread Yasemin Kaya
Hi , I am using Random Forest Alg. for recommendation system. I get users and users' response yes or no (1/0). But I want to learn the probability of the trees. Program says x user yes but with how much probability, I want to get these probabilities. Best, yasemin -- hiç ender hiç

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Cody Koeninger
Post the actual stacktrace you're getting On Thu, Sep 10, 2015 at 12:21 AM, Shushant Arora wrote: > Executors in spark streaming 1.3 fetch messages from kafka in batches and > what happens when executor takes longer time to complete a fetch batch > > say in > > > directKafkaStream.foreachRDD(new

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Cody Koeninger
The kafka direct stream meets those requirements. You don't need checkpointing for exactly-once. Indeed, unless your output operations are idempotent, you can't get exactly-once if you're relying on checkpointing. Instead, you need to store the offsets atomically in the same transaction as your r

Re: Random Forest MLlib

2015-09-10 Thread Maximo Gurmendez
Hi Yasemin, We had the same question and found this: https://issues.apache.org/jira/browse/SPARK-6884 Thanks, Maximo On Sep 10, 2015, at 9:09 AM, Yasemin Kaya mailto:godo...@gmail.com>> wrote: Hi , I am using Random Forest Alg. for recommendation system. I get users and users' response

pyspark driver in cluster rather than gateway/client

2015-09-10 Thread roy
Hi, Is there any way to make spark driver to run in side YARN containers rather than gateway/client machine. At present even with config parameters --master yarn & --deploy-mode cluster driver runs on gateway/client machine. We are on CDH 5.4.1 with YARN and Spark 1.3 any help on this ? Th

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
Thanks guys for your answers. I put my answers in text, below. Cheers, Krzysztof Zarzycki 2015-09-10 15:39 GMT+02:00 Cody Koeninger : > The kafka direct stream meets those requirements. You don't need > checkpointing for exactly-once. Indeed, unless your output operations are > idempotent, you

Re: Tungsten and Spark Streaming

2015-09-10 Thread Gurvinder Singh
On 09/10/2015 07:42 AM, Tathagata Das wrote: > Rewriting is necessary. You will have to convert RDD/DStream operations > to DataFrame operations. So get the RDDs in DStream, using > transform/foreachRDD, convert to DataFrames and then do DataFrame > operations. Are there any plans for 1.6 or later

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Cody Koeninger
You have to store offsets somewhere. If you're going to store them in checkpoints, then you have to deal with the fact that checkpoints aren't recoverable on code change. Starting up the new version helps because you don't start it from the same checkpoint directory as the running one... it has y

How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread unk1102
Hi Spark 1.5 looks promising how do we enable project tungsten for spark sql or is it enabled by default please guide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-Tungsten-in-Spark-1-5-for-Spark-SQL-tp24642.html Sent from the Apache Spark

Re: Tungsten and Spark Streaming

2015-09-10 Thread Todd Nist
https://issues.apache.org/jira/browse/SPARK-8360?jql=project%20%3D%20SPARK%20AND%20text%20~%20Streaming -Todd On Thu, Sep 10, 2015 at 10:22 AM, Gurvinder Singh < gurvinder.si...@uninett.no> wrote: > On 09/10/2015 07:42 AM, Tathagata Das wrote: > > Rewriting is necessary. You will have to convert

Re: How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread Ted Yu
Please see the following in sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala : val TUNGSTEN_ENABLED = booleanConf("spark.sql.tungsten.enabled", defaultValue = Some(true), doc = "When true, use the optimized Tungsten physical execution backend which explicitly " + "man

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Shushant Arora
Stack trace is 15/09/09 22:49:52 ERROR kafka.KafkaRDD: Lost leader for topic topicname partition 99, sleeping for 200ms kafka.common.NotLeaderForPartitionException at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessor

Re: Random Forest MLlib

2015-09-10 Thread Yasemin Kaya
Hi Maximo, Thanks alot.. Hi Yasemin, We had the same question and found this: https://issues.apache.org/jira/browse/SPARK-6884 Thanks, Maximo On Sep 10, 2015, at 9:09 AM, Yasemin Kaya wrote: Hi , I am using Random Forest Alg. for recommendation system. I get users and users' response ye

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Cody Koeninger
NotLeaderForPartitionException means you lost a kafka broker or had a rebalance... why did you say " I am getting Connection tmeout in my code." You've asked questions about this exact same situation before, the answer remains the same On Thu, Sep 10, 2015 at 9:44 AM, Shushant Arora wrote: > St

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dibyendu Bhattacharya
Hi, Just to clarify one point which may not be clear to many. If someone decides to use Receiver based approach , the best options at this point is to use https://github.com/dibbhatt/kafka-spark-consumer. This will also work with WAL like any other receiver based consumer. The major issue with K

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Shushant Arora
My bad Got that exception in driver code of same job not in executor. But it says of socket close exception only. org.apache.spark.SparkException: ArrayBuffer(java.io.EOFException: Received -1 when reading from channel, socket has likely been closed., org.apache.spark.SparkException: Couldn't fi

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
Thanks Cody for your answers. This discussion helped me a lot. Though, I still feel that offsets management could be better handled by Spark, if it really wants to be easy streaming framework. If it won't help users to do so, I'm affraid it will be superseded for many by other frameworks that might

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Cody Koeninger
Again, that looks like you lost a kafka broker. Executors will retry failed tasks automatically up to the max failures. spark.streaming.kafka.maxRetries controls the number of times the driver will retry when attempting to get offsets. If your broker isn't up / rebalance hasn't finished after N

Re: How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread Umesh Kacha
Nice Ted thanks much highest performance without any configuration changes amazed! Looking forward to running Spark 1.5 on my 2 tb skewed data which involves group by union etc any other tips if you know for spark 1.5 On Sep 10, 2015 8:12 PM, "Ted Yu" wrote: > Please see the following > in sql/c

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Cody Koeninger
There is no free lunch. (TD, when do we get shirts made that say that?) If you want exactly-once delivery semantics for arbitrary workflows into arbitrary datastores, you're going to have to do some of your own work. If someone is telling you otherwise, they're probably lying to you. I think wr

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
great. so, provided that *model.theta* represents the log-probabilities and (hence the result of *brzPi + brzTheta * testData.toBreeze* is a big number too), how can I get back the *non-*log-probabilities which - apparently - are bounded between *0.0 and 1.0*? *// Adamantios* On Tue, Sep 1, 2

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-10 Thread Ashish Shenoy
I am using spark-1.4.1 Here's the skeleton code: JavaPairRDD rddPair = rdd.repartitionAndSortWithinPartitions( new CustomPartitioner(), new ExportObjectComparator()) .persist(StorageLevel.MEMORY_AND_DISK_SER()); ... @SuppressWarnings("serial") private static class CustomPartitioner exte

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread Work
Ewan, What issue are you having with HDFS when only Spark is installed? I'm not aware of any issue like this. Thanks,  Jonathan — Sent from Mailbox On Wed, Sep 9, 2015 at 11:48 PM, Ewan Leith wrote: > The last time I checked, if you launch EMR 4 with only Spark selected as an > app

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Sean Owen
The log probabilities are unlikely to be very large, though the probabilities may be very small. The direct answer is to exponentiate brzPi + brzTheta * testData.toBreeze -- apply exp(x). I have forgotten whether the probabilities are normalized already though. If not you'll have to normalize to g

Re: Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Mohammad Islam
In addition to Cheng's comment -- I found the similar problem when hive-site.xml is not in the class path. A proper stack trace can pinpoint the problem. In the mean time, you can add it into your environment through HADOOP_CLASSPATH. (export HADOOP_CONF_DIR=/etc/hive/conf/) See more at http:

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
Thanks Sean. As far as I can see probabilities are NOT normalized; denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator, I refer to the probability of feature X). So, for given lambda, how to compute the denominator? FYI: https://github.com/apache/spark/blob/v1.5.0/mllib/src/mai

Spark task hangs infinitely when accessing S3

2015-09-10 Thread Mario Pastorelli
Dear community, I am facing a problem accessing data on S3 via Spark. My current configuration is the following: - Spark 1.4.1 - Hadoop 2.7.1 - hadoop-aws-2.7.1 - mesos 0.22.1 I am accessing the data using the s3a protocol but it just hangs. The job runs through the whole data set but systematica

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Sean Owen
Yes, https://github.com/apache/spark/blob/v1.5.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala#L158 is the method you are interested in. It does normalize the probabilities and return them to non-log-space. So you can use predictProbabilities to get the actual posteri

Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-10 Thread Tom Waterhouse (tomwater)
After spending most of yesterday scouring the Internet for sources of documentation for submitting Spark jobs in cluster mode to a Spark cluster managed by Mesos I was able to do just that, but I am not convinced that how I have things setup is correct. I used the Mesos published

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-10 Thread Tim Chen
Hi Tom, Sorry the documentation isn't really rich, since it's probably assuming users understands how Mesos and framework works. First I need explain the rationale of why create the dispatcher. If you're not familiar with Mesos yet, each node in your datacenter is installed a Mesos slave where it

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-10 Thread Ted Yu
Here is snippet of ExternalSorter.scala where ArrayIndexOutOfBoundsException was thrown: while (iterator.hasNext) { val partitionId = iterator.nextPartition() iterator.writeNext(partitionWriters(partitionId)) } Meaning, partitionId was negative. Execute the following and examin

connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-10 Thread roni
I have spark installed on a EC2 cluster. Can I connect to that from my local sparkR in RStudio? if yes , how ? Can I read files which I have saved as parquet files on hdfs or s3 in sparkR ? If yes , How? Thanks -Roni

Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Michael Armbrust
I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising. In my experiments Spark 1.5 is either the same or faster than 1.4 with only small exceptions. A few thoughts, - 600 partitions is probably way too many for 6G of data. - Providing the output of ex

Re: Creating Parquet external table using HiveContext API

2015-09-10 Thread Michael Armbrust
Easiest is to just use SQL: hiveContext.sql("CREATE TABLE USING parquet OPTIONS (path '')") When you specify the path its automatically created as an external table. The schema will be discovered. On Wed, Sep 9, 2015 at 9:33 PM, Mohammad Islam wrote: > Hi, > I want to create an external hive

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Michael Armbrust
Either that or use the DataFrame API, which directly constructs query plans and thus doesn't suffer from injection attacks (and runs on the same execution engine). On Thu, Sep 10, 2015 at 12:10 AM, Sean Owen wrote: > I don't think this is Spark-specific. Mostly you need to escape / > quote user-

Re: Custom UDAF Evaluated Over Window

2015-09-10 Thread Michael Armbrust
The only way to do this today is to write it as a Hive UDAF. We hope to improve the window functions to use our native aggregation in a future release. On Thu, Sep 10, 2015 at 12:26 AM, xander92 wrote: > While testing out the new UserDefinedAggregateFunction in Spark 1.5.0, I > successfully imp

Re: Creating Parquet external table using HiveContext API

2015-09-10 Thread Mohammad Islam
Thanks a lot Michael for giving a solution. If I want to provide my own schema, can I do that? On Thursday, September 10, 2015 11:05 AM, Michael Armbrust wrote: Easiest is to just use SQL: hiveContext.sql("CREATE TABLE USING parquet OPTIONS (path '')") When you specify the path i

reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread roni
I am trying this - ddf <- parquetFile(sqlContext, "hdfs:// ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet") and I get path[1]="hdfs:// ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet": No such file or directory when I read file on s3 , I get -

java.lang.NullPointerException with Twitter API

2015-09-10 Thread Jo Sunad
Hello! I am trying to customize the Twitter Example TD did by only printing messages that have a GeoLocation. I am getting a NullPointerException: java.lang.NullPointerException at Twitter$$anonfun$1.apply(Twitter.scala:64) at Twitter$$anonfun$1.apply(Twitter.scala:64) at

Sprk RDD : want to combine elements that have approx same keys

2015-09-10 Thread prateek arora
Hi In my scenario : I have rdd with key/value pair . i want to combine elements that have approx same keys. like (144,value)(143,value)(142,value)...(214,value)(213,value)(212,value)(313,value)(314,value)... i want to combine elements that have key 144.143,142... means keys have +-2 r

Re: Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-10 Thread Tathagata Das
Spark 1.4.0 introduced built-in shutdown hooks that would shutdown StreamingContext and SparkContext (similar to yours). If you are also introducing your shutdown hook, I wonder whats the behavior going to be. Try doing a jstack to see where the system is stuck. Alternatively, remove your shutdown

Re: Re: Failed when starting Spark 1.5.0 standalone cluster

2015-09-10 Thread Adam Hunt
You can add it to conf/spark-env.sh. $ cat conf/spark-env.sh #!/usr/bin/env bash JAVA_HOME=/app/tools/jdk1.7 PATH=$JAVA_HOME/bin:$PATH MESOS_NATIVE_JAVA_LIBRARY="/usr/lib/libmesos.so" SPARK_CLASSPATH="/opt/mapr/hadoop/hadoop-0.20.2/lib/amazon-s3.jar" On Wed, Sep 9, 2015 at 10:25 PM, Netwaver

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-10 Thread Ashish Shenoy
Yup thanks Ted. My getPartition() method had a bug where a signed int was being moduloed with the number of partitions. Fixed that. Thanks, Ashish On Thu, Sep 10, 2015 at 10:44 AM, Ted Yu wrote: > Here is snippet of ExternalSorter.scala where ArrayIndexOutOfBoundsException > was thrown: > >

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-10 Thread Ted Yu
Created https://github.com/apache/spark/pull/8703 to make exception message more helpful. On Thu, Sep 10, 2015 at 1:24 PM, Ashish Shenoy wrote: > Yup thanks Ted. My getPartition() method had a bug where a signed int was > being moduloed with the number of partitions. Fixed that. > > Thanks, > As

How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
I have invoked mvn test with the -DwildcardSuites option to specify a single BinarizerSuite scalatest suite. The command line is mvn -pl mllib -Pyarn -Phadoop-2.6 -Dhadoop2.7.1 -Dscala-2.11 -Dmaven.javadoc.skip=true -DwildcardSuites=org.apache.spark.ml.feature.BinarizerSuite test The scala

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Tathagata Das
The metadata checkpointing interval does not really affect any performance, so I didnt expose any way to control that interval. The data checkpointing interval actually affects performance, hence the interval is configurable. On Thu, Sep 10, 2015 at 5:45 AM, Dmitry Goldenberg wrote: > >> The w

Re: How to restrict java unit tests from the maven command line

2015-09-10 Thread Sean Owen
-Dtest=none ? https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests On Thu, Sep 10, 2015 at 10:39 PM, Stephen Boesch wrote: > > I have invoked mvn test with the -DwildcardSuites option to specify a single > BinarizerSuite scalatest s

Re: pyspark driver in cluster rather than gateway/client

2015-09-10 Thread Davies Liu
The YARN cluster mode for PySpark is supported since Spark 1.4: https://issues.apache.org/jira/browse/SPARK-5162?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22python%20cluster%22 On Thu, Sep 10, 2015 at 6:54 AM, roy wrote: > Hi, > > Is there any way to make spark driver to run in side YARN co

Re: How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
Yes, adding that flag does the trick. thanks. 2015-09-10 13:47 GMT-07:00 Sean Owen : > -Dtest=none ? > > > https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests > > On Thu, Sep 10, 2015 at 10:39 PM, Stephen Boesch > wrote: > > > > I

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2015-09-10 Thread swetha
Hi, How is the ContextCleaner different from spark.cleaner.ttl?Is spark.cleaner.ttl when there is ContextCleaner in the Streaming job? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-variable-get-cleaned-by-ContextCleaner-unexpect

Re: Sprk RDD : want to combine elements that have approx same keys

2015-09-10 Thread sethah
If you want each key to be combined only once, you can just create a mapping of keys to a reduced key space. Something like this val data = sc.parallelize(Array((0,0.030513227), (1,0.11088216), (2,0.69165534), (3,0.78524816), (4,0.8516909), (5,0.37751913), (6,0.05674714), (7,0.27523404), (8,0.4082

Spark UI keeps redirecting to /null and returns 500

2015-09-10 Thread rajeevpra
Hi All, I am having problem in accessing spark UI while running in spark-client mode. It works fine in local mode. It keeps redirecting back to itself by adding /null at the end and ultimately run out of size limit for url and returns 500. Look at response below. I have a feeling that I might b

Re: Spark on Yarn vs Standalone

2015-09-10 Thread Sandy Ryza
YARN will never kill processes for being unresponsive. It may kill processes for occupying more memory than it allows. To get around this, you can either bump spark.yarn.executor.memoryOverhead or turn off the memory checks entirely with yarn.nodemanager.pmem-check-enabled. -Sandy On Tue, Sep 8

Spark UI keeps redirecting to /null and returns 500

2015-09-10 Thread Rajeev Prasad
I am having problem in accessing spark UI while running in spark-client mode. It works fine in local mode. It keeps redirecting back to itself by adding /null at the end and ultimately run out of size limit for url and returns 500. Look at response below. I have a feeling that I might be missing

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-10 Thread Ricardo Luis Silva Paiva
Hi guys, I tried to use the configuration file, but it didn't work as I expected. As part of the Spark Streaming flow, my methods run only when the application is started the first time. Once I restart the app, it reads from the checkpoint and all the dstream operations come from the cache. No par

Spark based Kafka Producer

2015-09-10 Thread Atul Kulkarni
Hi Folks, Below is the code have for Spark based Kafka Producer to take advantage of multiple executors reading files in parallel on my cluster but I am stuck at The program not making any progress. Below is my scrubbed code: val sparkConf = new SparkConf().setAppName(applicationName) val ssc =

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Ruslan Dautkhanov
Using dataframe API is a good workaround. Another way would be to use bind variables. I don't think Spark SQL supports them. That's what Dinesh probably meant by "was not able to find any API for preparing the SQL statement safely avoiding injection". E.g. val sql_handler = sqlContext.sql("SELEC

Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Ruslan Dautkhanov
Sathish, Thanks for pointing to that. https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm That must be only part of Oracle's BDA codebase, not open-source Hive, right? -- Ruslan Dautkhanov On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wr

Re: about mr-style merge sort

2015-09-10 Thread 周千昊
Hi, all Can anyone give some tips about this issue? 周千昊 于2015年9月8日周二 下午4:46写道: > Hi, community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from Hive and output to hfile which will > be bulk load to HBase Table, details as follow:

Re: about mr-style merge sort

2015-09-10 Thread Raghavendra Pandey
In mr jobs, the output is sorted only within reducer.. That can be better emulated by sorting each partition of rdd rather than total sorting the rdd.. In Rdd.mapPartition you can sort the data in one partition and try... On Sep 11, 2015 7:36 AM, "周千昊" wrote: > Hi, all > Can anyone give some

Re: Spark based Kafka Producer

2015-09-10 Thread Raghavendra Pandey
What is the value of spark master conf.. By default it is local, that means only one thread can run and that is why your job is stuck. Specify it local[*], to make thread pool equal to number of cores... Raghav On Sep 11, 2015 6:06 AM, "Atul Kulkarni" wrote: > Hi Folks, > > Below is the code ha

Re: How to keep history of streaming statistics

2015-09-10 Thread b.bhavesh
Hi Himanshu Mehra, Thanks for reply. I am running spark standalone cluster. I have already set the property regarding logging events in history server as you mentioned. I have also started the history server. I am running my code with awaitTermination(). So it never going to completed jobs. Howe

RE: Maintaining Kafka Direct API Offsets

2015-09-10 Thread Samya
Thanks Ameya. From: ameya [via Apache Spark User List] [mailto:ml-node+s1001560n24650...@n3.nabble.com] Sent: Friday, September 11, 2015 4:12 AM To: Samya MAITI Subject: Re: Maintaining Kafka Direct API Offsets So I added something like this: Runtime.getRuntime().addShutdownHook(new Thread()

RE: reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread Sun, Rui
Hi, Roni, For parquetFile(), it is just a warning, you can get the DataFrame successfully, right? It is a bug has been fixed in the latest repo: https://issues.apache.org/jira/browse/SPARK-8952 For S3, it is not related to SparkR. I guess it is related to http://stackoverflow.com/questions/280

Re:Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Thanks Michael for the reply. Below is the sql plan for 1.5 and 1.4. 1.5 is using SortMergeJoin, while 1.4.1 is using shuffled hash join. In this case, it seems hash join performs better than sort join.

Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Jesse F Chen
Could this be a build issue (i.e., sbt package)? If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too in queries (all other things identical)... I am curious, to build 1.5 (when it isn't released yet), what do I need to do with the build.sbt file? any special

Re: Spark based Kafka Producer

2015-09-10 Thread Atul Kulkarni
I am submitting the job with yarn-cluster mode. spark-submit --master yarn-cluster ... On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > What is the value of spark master conf.. By default it is local, that > means only one thread can run and that is wh

Re: about mr-style merge sort

2015-09-10 Thread Saisai Shao
Hi Qianhao, I think you could sort the data by yourself if you want achieve the same result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each partition). Do not call sortByKey again since it will introduce another shuffle (that's the reason why it is slower than MR). The problem

How to create combine DAG visualization?

2015-09-10 Thread b.bhavesh
Hi, How can I create combine DAG visualization of pyspark code instead of separate DAGs of jobs and stages? Thanks b.bhavesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-combine-DAG-visualization-tp24653.html Sent from the Apache Spark Us

Re: about mr-style merge sort

2015-09-10 Thread 周千昊
Hi, Shao & Pendey Thanks for tips. I will try to workaround this. Saisai Shao 于2015年9月11日周五 下午1:23写道: > Hi Qianhao, > > I think you could sort the data by yourself if you want achieve the same > result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each > partition). Do not

RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Cheng, Hao
This is not a big surprise the SMJ is slower than the HashJoin, as we do not fully utilize the sorting yet, more details can be found at https://issues.apache.org/jira/browse/SPARK-2926 . Anyway, can you disable the sort merge join by “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and r

Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Thanks Hao for the reply. I turn the merge sort join off, the physical plan is below, but the performance is roughly the same as it on... == Physical Plan == TungstenProject [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0] ShuffledHas

Data lost in spark streaming

2015-09-10 Thread Bin Wang
I'm using spark streaming 1.4.0 and have a DStream that have all the data it received. But today the history data in the DStream seems to be lost suddenly. And the application UI also lost the streaming process time and all the related data. Could any give some hint to debug this? Thanks.

RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Cheng, Hao
You mean the performance is still slow as the SMJ in Spark 1.5? Can you set the spark.shuffle.reduceLocality.enabled=false when you start the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by default, but we found it probably causes the performance reduce dramatically. F

Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Yes... At 2015-09-11 14:34:46, "Cheng, Hao" wrote: You mean the performance is still slow as the SMJ in Spark 1.5? Can you set the spark.shuffle.reduceLocality.enabled=false when you start the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by default, but we found

Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Thanks Hao. Yes,it is still low as SMJ。Let me try the option your suggested, At 2015-09-11 14:34:46, "Cheng, Hao" wrote: You mean the performance is still slow as the SMJ in Spark 1.5? Can you set the spark.shuffle.reduceLocality.enabled=false when you start the spark-shell/spark-sql?