Re: value reduceByKeyAndWindow is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2015-04-13 Thread Akhil Das
Just make sure you import the followings: import org.apache.spark.SparkContext._ import org.apache.spark.StreamingContext._ Thanks Best Regards On Wed, Apr 8, 2015 at 6:38 AM, Su She wrote: > Hello Everyone, > > I am trying to implement this example (Spark Streaming with Twitter). > > > http

Re: set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-13 Thread Akhil Das
You could try leaving all the configuration values to default and running your application and see if you are still hitting the heap issue, If so try adding a Swap space to the machines which will definitely help. Another way would be to set the heap space manually (export _JAVA_OPTIONS="-Xmx5g")

Re: Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-13 Thread Akhil Das
When you say "done fetching documents", does it mean that you are stopping the streamingContext? (ssc.stop) or you meant completed fetching documents for a batch? If possible, you could paste your custom receiver code so that we can have a look at it. Thanks Best Regards On Tue, Apr 7, 2015 at 8:

Re: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
That explains it. Thanks Reynold. Justin On Mon, Apr 13, 2015 at 11:26 PM, Reynold Xin wrote: > I think what happened was applying the narrowest possible type. Type > widening is required, and as a result, the narrowest type is string between > a string and an int. > > > https://github.com/apac

Re: SparkSQL + Parquet performance

2015-04-13 Thread Akhil Das
That totally depends on your disk IO and the number of CPUs that you have in the cluster. For example, if you are having a disk IO of 100MB/s and a handful of CPUs ( say 40 cores, on 10 machines), then it could take you to ~ 1GB/Sec i believe. Thanks Best Regards On Tue, Apr 7, 2015 at 2:48 AM, P

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-04-13 Thread Akhil Das
One hack you can put in would be to bring Result class locally and serialize it (implements serializable) and use it. Tha

Re: [Spark1.3] UDF registration issue

2015-04-13 Thread Reynold Xin
You can do this: strLen = udf((s: String) => s.length()) cleanProcessDF.withColumn("dii",strLen(col("di"))) (You might need to play with the type signature a little bit to get it to compile) On Fri, Apr 10, 2015 at 11:30 AM, Yana Kadiyska wrote: > Hi, I'm running into some trouble trying to r

Re: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Reynold Xin
I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scal

Re: RDD generated on every query

2015-04-13 Thread twinkle sachdeva
Hi, If you have the same spark context, then you can cache the query result via caching the table ( sqlContext.cacheTable("tableName") ). Maybe you can have a look at OOyola server also. On Tue, Apr 14, 2015 at 11:36 AM, Akhil Das wrote: > You can use a tachyon based storage for that and eve

Re: RDD generated on every query

2015-04-13 Thread Akhil Das
You can use a tachyon based storage for that and everytime the client queries, you just get it from there. Thanks Best Regards On Mon, Apr 6, 2015 at 6:01 PM, Siddharth Ubale wrote: > Hi , > > > > In Spark Web Application the RDD is generating every time client is > sending a query request. Is

Fwd: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala> adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala> bdf.printSchema() root |-- a: string

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-04-13 Thread sachin Singh
Hi Linlin, have you got the solution for this issue, if yes then what are the thing need to make correct,because I am also getting same error,when submitting spark job in cluster mode getting error as under - 2015-04-14 18:16:43 DEBUG Transaction - Transaction rolled back in 0 ms 2015-04-14 18:16:4

Re: [Spark1.3] UDF registration issue

2015-04-13 Thread Takeshi Yamamuro
Hi, It's a syntax error in Spark-1.3. The next release of spark supports the kind of UDF calls in DataFrame. See a link below. https://issues.apache.org/jira/browse/SPARK-6379 On Sat, Apr 11, 2015 at 3:30 AM, Yana Kadiyska wrote: > Hi, I'm running into some trouble trying to register a UDF: >

How can I add my custom Rule to spark sql?

2015-04-13 Thread Andy Zhao
Hi guys, I want to add my custom Rules(whatever the rule is) when the sql Logical Plan is being analysed. Is there a way to do that in the spark application code? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-add-my-custom-Rule-to-spark

Re: [GraphX] aggregateMessages with active set

2015-04-13 Thread James
Hello, Great thanks for your reply. From the code I found that the reason why my program will scan all the edges is becasue of the EdgeDirection I passed into is EdgeDirection.Either. However I still met the problem of "Time consuming of each iteration will not decrease by time". Thus I have two

Re: Understanding Spark Memory distribution

2015-04-13 Thread Imran Rashid
broadcast variables count towards "spark.storage.memoryFraction", so they use the same "pool" of memory as cached RDDs. That being said, I'm really not sure why you are running into problems, it seems like you have plenty of memory available. Most likely its got nothing to do with broadcast varia

Re: Registering classes with KryoSerializer

2015-04-13 Thread Imran Rashid
Those funny class names come from scala's specialization -- its compiling a different version of OpenHashMap for each primitive you stick in the type parameter. Here's a super simple example: *➜ **~ * more Foo.scala class Foo[@specialized X] *➜ **~ * scalac Foo.scala *➜ **~ * ls Foo*.cl

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Vadim Bichutskiy
Thanks Mike. I was having trouble on EC2. > On Apr 13, 2015, at 10:25 PM, Mike Trienis wrote: > > Thanks Vadim, I can certainly consume data from a Kinesis stream when running > locally. I'm currently in the processes of extending my work to a proper > cluster (i.e. using a spark-submit job vi

Re: How to get rdd count() without double evaluation of the RDD?

2015-04-13 Thread Imran Rashid
yes, it sounds like a good use of an accumulator to me val counts = sc.accumulator(0L) rdd.map{x => counts += 1 x }.saveAsObjectFile(file2) On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > Sean > > > > Yes I know that I can use persist() to

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
Thanks Vadim, I can certainly consume data from a Kinesis stream when running locally. I'm currently in the processes of extending my work to a proper cluster (i.e. using a spark-submit job via uber jar). Feel free to add me to gmail chat and maybe we can help each other. On Mon, Apr 13, 2015 at 6

Re: Task result in Spark Worker Node

2015-04-13 Thread Imran Rashid
On the worker side, it all happens in Executor. The task result is computed here: https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210 then its serialized along with some other goodies, and finally sent ba

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Vadim Bichutskiy
I don't believe the Kinesis asl should be provided. I used mergeStrategy successfully to produce an "uber jar." Fyi, I've been having trouble consuming data out of Kinesis with Spark with no success :( Would be curious to know if you got it working. Vadim > On Apr 13, 2015, at 9:36 PM, Mike T

sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn] Merging 'META-INF/NOTICE' with strategy 'rename' [warn] Merging 'META-INF/LICENSE.txt' with strategy 'rename' [warn] Merging 'META-INF/LICENSE' with strat

How to access postgresql on Spark SQL

2015-04-13 Thread doovsaid
Hi all, Who know how to access postgresql on Spark SQL? Do I need add the postgresql dependency in build.sbt and set class path for it? Thanks. RegardsYi

Spark Worker IP Error

2015-04-13 Thread DStrip
I tried to start the Spark Worker using the registered IP but this error occurred: 15/04/13 21:35:59 INFO Worker: Registered signal handlers for [TERM, HUP, INT] Exception in thread "main" java.net.UnknownHostException: 10.240.92.75/: Name or service not known at java.net.Inet6AddressImpl.

Registering classes with KryoSerializer

2015-04-13 Thread Arun Lists
Hi, I am trying to register classes with KryoSerializer. This has worked with other programs. Usually the error messages are helpful in indicating which classes need to be registered. But with my current program, I get the following cryptic error message: *Caused by: java.lang.IllegalArgumentExce

Re: org.apache.spark.ml.recommendation.ALS

2015-04-13 Thread Jay Katukuri
Hi Xiangrui, Here is the class: object ALSNew { def main (args: Array[String]) { val conf = new SparkConf() .setAppName("TrainingDataPurchase") .set("spark.executor.memory", "4g") conf.set("spark.shuffle.memoryFraction","0.65") //default is 0.2 conf.set("sp

Re: Rack locality

2015-04-13 Thread Sandy Ryza
Hi Riya, As far as I know, that is correct, unless Mesos fine-grained mode handles this in some mysterious way. -Sandy On Mon, Apr 13, 2015 at 2:09 PM, rcharaya wrote: > I want to use Rack locality feature of Apache Spark in my application. > > Is YARN the only resource manager which supports

Rack locality

2015-04-13 Thread rcharaya
I want to use Rack locality feature of Apache Spark in my application. Is YARN the only resource manager which supports it as of now? After going through the source code, I found that default implementation of getRackForHost() method returns NONE in TaskSchedulerImpl which (I suppose) would be us

Re: Need some guidance

2015-04-13 Thread Dean Wampler
That appears to work, with a few changes to get the types correct: input.distinct().combineByKey((s: String) => 1, (agg: Int, s: String) => agg + 1, (agg1: Int, agg2: Int) => agg1 + agg2) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: How to load avro file into spark not on Hadoop in pyspark?

2015-04-13 Thread sa
Note that I am running pyspark in local mode (I do not have a hadoop cluster connected) as I want to be able to work with the avro file outside of hadoop. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-avro-file-into-spark-not-on-Hadoop-in-pyspa

Re: Spark support for Hadoop Formats (Avro)

2015-04-13 Thread Michael Armbrust
The problem is likely that the underlying avro library is reusing objects for speed. You probably need to explicitly copy the values out of the reused record before the collect. On Sat, Apr 11, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > The read seem to be successfully as the values for each field

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-13 Thread Michael Armbrust
> > Here is the stack trace. The first part shows the log when the session is > started in Tableau. It is using the init sql option on the data > connection to create theTEMPORARY table myNodeTable. > Ah, I see. thanks for providing the error. The problem here is that temporary tables do not exis

Re: Need some guidance

2015-04-13 Thread Victor Tso-Guillen
How about this? input.distinct().combineByKey((v: V) => 1, (agg: Int, x: Int) => agg + 1, (agg1: Int, agg2: Int) => agg1 + agg2).collect() On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler wrote: > The problem with using collect is that it will fail for large data sets, > as you'll attempt to copy

How to load avro file into spark not on Hadoop in pyspark?

2015-04-13 Thread sa
I have 2 questions related to pyspark: 1. How do I load an avro file that is on the local filesystem as opposed to hadoop? I tried the following and I just get NullPointerExceptions: avro_rdd = sc.newAPIHadoopFile( "file:///c:/my-file.avro", "org.apache.avro.mapreduce.AvroKeyInputFormat",

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney
Nothing so complicated... we are seeing mesos kill off our executors immediately. When I reroute logging to an NFS directory we have available, the executors survive fine. As such I am wondering if the spark workers are getting killed by mesos for exceeding their disk quota (which atm is 0). This c

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Patrick Wendell
Hey Jonathan, Are you referring to disk space used for storing persisted RDD's? For that, Spark does not bound the amount of data persisted to disk. It's a similar story to how Spark's shuffle disk output works (and also Hadoop and other frameworks make this assumption as well for their shuffle da

Re: Spark TeraSort source request

2015-04-13 Thread Ewan Higgs
Tom, According to Github's public activity log, Reynold Xin (in CC) deleted his sort-benchmark branch yesterday. I didn't have a local copy aside from the Daytona Partitioner (attached). Reynold, is it possible to reinstate your branch? -Ewan On 13/04/15 16:41, Tom Hubregtsen wrote: Thank yo

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Guillaume Pitel
That's why I think it's the OOM killer. There are several cases of memory overuse / errors : 1 - The application tries to allocate more than the Heap limit and GC cannot free more memory => OutOfMemory : Java Heap Space exception from JVM 2 - The jvm is configured with a max heap size larger th

Re: feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Xiangrui Meng
Correct. Prediction doesn't touch that code path. -Xiangrui On Mon, Apr 13, 2015 at 9:58 AM, Jianguo Li wrote: > Hi, > > In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it > says "if userFeatureScaling is enabled, we will standardize the training > features , and trai

Re: Opening many Parquet files = slow

2015-04-13 Thread Eric Eijkelenboom
Hi guys Does anyone know how to stop Spark from opening all Parquet files before starting a job? This is quite a show stopper for me, since I have 5000 Parquet files on S3. Recap of what I tried: 1. Disable schema merging with: sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path"

Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zhan Zhang
Hi Zork, >From the exception, it is still caused by hdp.version not being propagated >correctly. Can you check whether there is any typo? [root@c6402 conf]# more java-opts -Dhdp.version=2.2.0.0–2041 [root@c6402 conf]# more spark-defaults.conf spark.driver.extraJavaOptions -Dhdp.version=2.2.0.

Re: counters in spark

2015-04-13 Thread Grandl Robert
Guys, Do you have any thoughts on this ? Thanks,Robert On Sunday, April 12, 2015 5:35 PM, Grandl Robert wrote: Hi guys, I was trying to figure out some counters in Spark, related to the amount of CPU or Memory used (in some metric), used by a task/stage/job, but I could not find

Re: Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi
I think I found where the problem comes from. I am writing lzo compressed thrift records using elephant-bird, my guess is that perhaps one side is computing the checksum based on the uncompressed data and the other on the compressed data, thus getting a mismatch. When writing the data as strings

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
I'm not 100% sure of spark's implementation but in the MR frameworks, it would have a much larger shuffle write size becasue that node is dealing with a lot more data and as a result has a lot more to shuffle 2015-04-13 13:20 GMT-04:00 java8964 : > If it is really due to data skew, will the task

Re: Multiple Kafka Recievers

2015-04-13 Thread Cody Koeninger
As far as I know, createStream doesn't let you specify where receivers are run. createDirectStream in 1.3 doesn't use long-running receivers, so it is likely to give you more even distribution of consumers across your workers. On Mon, Apr 13, 2015 at 11:31 AM, Laeeq Ahmed wrote: > Hi, > > I hav

Re: Need some guidance

2015-04-13 Thread Dean Wampler
The problem with using collect is that it will fail for large data sets, as you'll attempt to copy the entire RDD to the memory of your driver program. The following works (Scala syntax, but similar to Python): scala> val i1 = input.distinct.groupByKey scala> i1.foreach(println) (1,CompactBuffer(b

RE: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread java8964
If it is really due to data skew, will the task hanging has much bigger Shuffle Write Size in this case? In this case, the shuffle write size for that task is 0, and the rest IO of this task is not much larger than the fast finished tasks, is that normal? I am also interested in this case, as fro

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Tim Chen
Linux OOM throws SIGTERM, but if I remember correctly JVM handles heap memory limits differently and throws OutOfMemoryError and eventually sends SIGINT. Not sure what happened but the worker simply received a SIGTERM signal, so perhaps the daemon was terminated by someone or a parent process. Jus

feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Jianguo Li
Hi, In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it says "if userFeatureScaling is enabled, we will standardize the training features , and trained the model in the scaled space. Then we transform the coefficients from the scaled space to the original space ...". My

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
I can promise you that this is also a problem in the pig world :) not sure why it's not a problem for this data set, though... are you sure that the two are doing the exact same code? you should inspect your source data. Make a histogram for each and see what the data distribution looks like. If t

Re: How to use multiple app jar files?

2015-04-13 Thread Marcelo Vanzin
You can copy the dependencies to all nodes in your cluster, and then use "spark.{executor,driver}.extraClassPath" to add them to the classpath of your executors / driver. On Mon, Apr 13, 2015 at 4:15 AM, Michael Weir wrote: > My app works fine with the single, "uber" jar file containing my app

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread ๏̯͡๏
You mean there is a tuple in either RDD, that has itemID = 0 or null ? And what is catch all ? That implies is it a good idea to run a filter on each RDD first ? We do not do this using Pig on M/R. Is it required in Spark world ? On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney wrote: > My gue

Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi
Hi, I am not sure my problem is relevant to spark, but perhaps someone else had the same error. When I try to write files that need multipart upload to S3 from a job on EMR I always get this error: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what

Need some guidance

2015-04-13 Thread Marco Shaw
**Learning the ropes** I'm trying to grasp the concept of using the pipeline in pySpark... Simplified example: >>> list=[(1,"alpha"),(1,"beta"),(1,"foo"),(1,"alpha"),(2,"alpha"),(2,"alpha"),(2,"bar"),(3,"foo")] Desired outcome: [(1,3),(2,2),(3,1)] Basically for each key, I want the number of un

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney
My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data sets have this sort of value and it always kills joins 2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : > Code: > > val viEventsWithListings: RDD[(Long, (DetailInputRecord, VIS

Help in transforming the RDD

2015-04-13 Thread Jeetendra Gangele
Hi All I have an JavaPairRDD where each long key have 4 string values associated with it. I want to fire the Hbase query for look up of the each String part of RDD. This look-up will give result of around 7K integers.so for each key I will have 7k values. now my input RDD always already more tha

Re: "Could not compute split, block not found" in Spark Streaming Simple Application

2015-04-13 Thread Saiph Kappa
Whether I use 1 or 2 machines, the results are the same... Here follows the results I got using 1 and 2 receivers with 2 machines: 2 machines, 1 receiver: sbt/sbt "run-main Benchmark 1 machine1 1000" 2>&1 | grep -i "Total delay\|record" 15/04/13 16:41:34 INFO JobScheduler: Total delay: 0.15

What's the cleanest way to make spark aware of my custom scheduler?

2015-04-13 Thread Jonathan Coveney
I need to have my own scheduler to point to a proprietary remote execution framework. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152 I'm looking at where it decides on the backend and it doesn't look like there is a hook. Of course I can

Spark Streaming Kafka Consumer, Confluent Platform, Avro & StorageLevel

2015-04-13 Thread Nicolas Phung
Hello, I'm trying to use a Spark Streaming (1.2.0-cdh5.3.2) consumer (via spark-streaming-kafka lib of the same version) with Kafka's Confluent Platform 1.0. I manage to make a Producer that produce my data and can check it via the avro-console-consumer : "./bin/kafka-avro-console-consumer --top

Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread ๏̯͡๏
Code: val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = lstgItem.join(viEvents).map { case (itemId, (listing, viDetail)) => val viSummary = new VISummary viSummary.leafCategoryId = listing.getLeafCategId().toInt viSummary.itemSiteId = listi

Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney
I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding how this is configured... Thanks for any help!

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen
Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs wrote: > Hi all. > The code is linked from my repo

Re: Packaging Java + Python library

2015-04-13 Thread prabeesh k
Refer this post http://blog.prabeeshk.com/blog/2015/04/07/self-contained-pyspark-application/ On 13 April 2015 at 17:41, Punya Biswal wrote: > Dear Spark users, > > My team is working on a small library that builds on PySpark and is > organized like PySpark as well -- it has a JVM component (tha

Packaging Java + Python library

2015-04-13 Thread Punya Biswal
Dear Spark users, My team is working on a small library that builds on PySpark and is organized like PySpark as well -- it has a JVM component (that runs in the Spark driver and executor) and a Python component (that runs in the PySpark driver and executor processes). What's a good approach for

Re: Exception"Driver-Memory" while running Spark job on Yarn-cluster

2015-04-13 Thread ๏̯͡๏
Try this ./bin/spark-submit -v --master yarn-cluster --jars ./libs/mysql-connector-java-5.1.17.jar,./libs/log4j-1.2.17.jar --files datasource.properties,log4j.properties --num-executors 1 --driver-memory 4g *--driver-java-options "-XX:MaxPermSize=1G"* --executor-memory 2g --executor-cores 1 --cla

Re: How to use multiple app jar files?

2015-04-13 Thread ๏̯͡๏
I faced exact same issue. The way i solved it was 1. Copy entire project. 2. Delete all the source, have only the dependencies in pom.xml. This will create, fat jar, without source but deps only. 3. In original project keep it as is, now build it. this will create a JAR (no deps, by default) Now

Sqoop parquet file not working in spark

2015-04-13 Thread bipin
Hi I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like : scala> val df1 = sqlContext.load("/home/bipin/Customer2") scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel comput

Re: MLlib : Gradient Boosted Trees classification confidence

2015-04-13 Thread mike
Thank you Peter. I just want to be sure. even if I use the "classification" setting the GBT uses regression trees and not classification trees? I know the difference between the two(theoretically) is only in the loss and impurity functions. thus in case it uses classification trees doing what you

How to use multiple app jar files?

2015-04-13 Thread Michael Weir
My app works fine with the single, "uber" jar file containing my app and all its dependencies. However, it takes about 20 minutes to copy the 65MB jar file up to the node on the cluster, so my "code, compile, test" cycle has become a "core, compile, cooppp, test" cycle. I'd like to hav

Reading files from http server

2015-04-13 Thread Peter Rudenko
Hi, i want to play with Criteo 1 tb dataset. Files are located on azure storage. Here's a command to download them: curl -O http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_{`seq -s ‘,’ 0 23`}.gz is there any way to read files through http protocol with spark without downloading

Re: Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-13 Thread Cheng Lian
Thanks Yijie! Also cc the user list. Cheng On 4/13/15 9:19 AM, Yijie Shen wrote: I opened a new Parquet JIRA ticket here: https://issues.apache.org/jira/browse/PARQUET-251 Yijie On April 12, 2015 at 11:48:57 PM, Cheng Lian (lian.cs@gmail.com ) wrote: Tha

Exception"Driver-Memory" while running Spark job on Yarn-cluster

2015-04-13 Thread sachin Singh
Hi , When I am submitting spark job as --master yarn-cluster with below command/options getting driver memory error- spark-submit --jars ./libs/mysql-connector-java-5.1.17.jar,./libs/log4j-1.2.17.jar --files datasource.properties,log4j.properties --master yarn-cluster --num-executors 1 --driver-m

Re: Problem getting program to run on 15TB input

2015-04-13 Thread Daniel Mahler
Sometimes a large number of partitions leads to memory problems. Something like val rdd1 = sc.textFile(file1).coalesce(500). ... val rdd2 = sc.textFile(file2).coalesce(500). ... may help. On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra wrote: > Everything works smoothly if I do the 99%-removal fi

Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zork Sail
Hi Zhan, Alas setting: -Dhdp.version=2.2.0.0–2041 Does not help. Still get the same error: 15/04/13 09:53:59 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1428918838408

Re: Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread Guillaume Pitel
Very likely to be this : http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2 Your worker ran out of memory => maybe you're asking for too much memory for the JVM, or something else is running on the worker Guillaume Any idea what this means, many thanks ==>

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Guillaume Pitel
Does it also cleanup spark local dirs ? I thought it was only cleaning $SPARK_HOME/work/ Guillaume I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=" On 11.04.2015, at 00:01, W

Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread James King
Any idea what this means, many thanks ==> logs/spark-.-org.apache.spark.deploy.worker.Worker-1-09.out.1 <== 15/04/13 07:07:22 INFO Worker: Starting Spark worker 09:39910 with 4 cores, 6.6 GB RAM 15/04/13 07:07:22 INFO Worker: Running Spark version 1.3.0 15/04/13 07:07:22 INFO Worke

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-13 Thread ๏̯͡๏
You need to do few more things or you will eventually run into these issues val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") * .set("spark.kryoserializer.buffer.mb", arguments.get("buffersize").get)* * .set("spark.kryoserializer.buff

RE: Kryo exception : Encountered unregistered class ID: 13994

2015-04-13 Thread mehdisinger
Hello, Thank you for your answer. I'm already registering my classes as you're suggesting... Regards De : tsingfu [via Apache Spark User List] [mailto:ml-node+s1001560n22468...@n3.nabble.com] Envoyé : lundi 13 avril 2015 03:48 À : Mehdi Singer Objet : Re: Kryo exception : Encountered unregiste

Manning looking for a co-author for the GraphX in Action book

2015-04-13 Thread Reynold Xin
Hi all, Manning (the publisher) is looking for a co-author for the GraphX in Action book. The book currently has one author (Michael Malak), but they are looking for a co-author to work closely with Michael and improve the writings and make it more consumable. Early access page for the book: http

Re: regarding ZipWithIndex

2015-04-13 Thread Jeetendra Gangele
How about using mapToPair and exchanging the two. Will it be efficient Below is the code , will it be efficient to convert like this. JavaPairRDD RddForMarch =matchRdd.zipWithindex.mapToPair(new PairFunction, Long, MatcherReleventData>() { @Override public Tuple2 call(Tuple2 t) throws Exception

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Marius Soutier
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=" > On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) > wrote: > > Does anybody have an answer for this? > > Thanks > Ningjun >