Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-28 Thread Sean Owen
It might kind of work, but you are effectively making all of your workers into mini, separate Spark drivers in their own right. This might cause snags down the line as this isn't the normal thing to do. On Tue, Oct 28, 2014 at 12:11 AM, Localhost shell wrote: > Hey lordjoe, > > Apologies for the

Re: Is Spark the right tool?

2014-10-28 Thread Akhil
You can use sparkstreaming to get the transactions from those TCP Connections periodically and you can push the data into HBase accordingly. Now, regarding the querying part, you can use a database like redis which actually does the key, value storing for you. You can use the RDDs to query (insert,

Why RDD is not cached?

2014-10-28 Thread shahab
Hi, I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores. I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see " BlockManagerMasterActor: Adde

Re: Why RDD is not cached?

2014-10-28 Thread Jagat Singh
What setting you are using for persist() or cache() http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Tue, Oct 28, 2014 at 6:18 PM, shahab wrote: > Hi, > > I have a standalone spark , where the executor is set to have 6.3 G memory > , as I am using two workers so in

Re: sampling in spark

2014-10-28 Thread Chengi Liu
Oops, the reference for the above code: http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu wrote: > Hi, > I have three rdds.. X,y and p > X is matrix rdd (mXn), y is (mX1) dimension ve

sampling in spark

2014-10-28 Thread Chengi Liu
Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y is (mX1) dimension vector and p is (mX1) dimension probability vector. Now, I am trying to sample k rows from X and corresponding entries in y based on probability vector p. Here is the python implementation import randomfrom bisect impo

Re: Why RDD is not cached?

2014-10-28 Thread Sean Owen
Did you just call cache()? By itself it does nothing but once an action requires it to be computed it should become cached. On Oct 28, 2014 8:19 AM, "shahab" wrote: > Hi, > > I have a standalone spark , where the executor is set to have 6.3 G memory > , as I am using two workers so in total there

Re: sampling in spark

2014-10-28 Thread Davies Liu
_cumm = [p[0]] for i in range(1, len(p)): _cumm.append(_cumm[-1] + p[i]) index = set([bisect(_cumm, random.random()) for i in range(k)]) chosed_x = X.zipWithIndex().filter(lambda (v, i): i in index).map(lambda (v, i): v) chosed_y = [v for i, v in

Re: sampling in spark

2014-10-28 Thread Chengi Liu
Is there an equivalent way of doing the following: a = [1,2,3,4] reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:] ?? The issue with above suggestion is that population is a hefty data structure :-/ On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu wrote: > _cumm = [p[0]] > for i in r

Singapore Meetup

2014-10-28 Thread Social Marketing
Dear Sir/Madam, This is Songtao, live in Singapore, doing some research with big data projects in NUS. I want to be an organiser for Singapore Meetup. Thanks. Songao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Why RDD is not cached?

2014-10-28 Thread shahab
I used Cache followed by a "count" on RDD to ensure that caching is performed. val rdd = srdd.flatMap(mapProfile_To_Sessions).cache val count = rdd.count //so at this point RDD should be cahed ? right? On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen wrote: > Did you just call cache()? By itself

Submiting Spark application through code

2014-10-28 Thread sivarani
Hi, i am submitting spark application in the following fashion bin/spark-submit --class "NetworkCount" --master spark://abc.test.com:7077 try/simple-project/target/simple-project-1.0-jar-with-dependencies.jar But is there any other way to submit spark application through the code? like for ex

Re: Spark Streaming Applications

2014-10-28 Thread sivarani
Hi tdas, is it possible to run spark 24/7, i am using updateStateByKey and i am streaming 3lac records in 1/2 hr, i am not getting the correct result also i am not not able to run spark streaming for 24/7 after hew hrs i get array out of bound exception even if i am not streaming anything? btw will

Re: Spark Streaming - How to remove state for key

2014-10-28 Thread sivarani
I am having the same issue, i am using update stateBykey and over a period a set of data will not change i would like save it and delete it from state.. have you found the answer? please share your views. Thanks for your time -- View this message in context: http://apache-spark-user-list.100156

Re: Submiting Spark application through code

2014-10-28 Thread Akhil Das
How about directly running it? val ssc = new StreamingContext("local[2]","Network WordCount",Seconds(5), "/home/akhld/mobi/localclusterxx/spark-1") val lines=ssc.socketTextStream("localhost", 12345) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread buring
yes ,I use standalone mode,I have set the "spark.io.compression.codec" with code : conf.set("spark.io.compression.codec","org.apache.spark.io.LZ4CompressionCodec") It seems have no influence on function "saveAsSequenceFile" which still used snappy compression internal. Thanks. -- Vi

Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk
Hello I am trying to reduce the number of java threads (about 80 on my system) to as few as there can be. What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places as well) I am also using hadoop for storing data on hdfs Thank you, Wanda

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread Shixiong Zhu
I mean updating the spark conf not only in the driver, but also in the Spark Workers. Because the driver configurations cannot be read by the Executors, they still use the default spark.io.compression.codec to deserialize the tasks. Best Regards, Shixiong Zhu 2014-10-28 16:39 GMT+08:00 buring :

How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch
I seem to recall there were some specific requirements on how to import the implicits. Here is the issue: scala> import org.apache.spark.mllib.rdd.RDDFunctions._ :10: error: object RDDFunctions in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import org.apache.spark.

Re: what classes are needed to register in KryoRegistrator, e.g. Row?

2014-10-28 Thread Fengyun RAO
Although nobody answers, as I tested, Row, MutableValue and there subclasses are not registered by default, which I think should be, since they would absolutely show up in Spark SQL. ​ 2014-10-26 23:43 GMT+08:00 Fengyun RAO : > In Tuning Spark ,

Re: Submiting Spark application through code

2014-10-28 Thread sivarani
Hi I know we can create spark context with new JavaStreamingContext(master, appName, batchDuration, sparkHome, jarFile) but to run the application we will have to use spark-home/spark-submit --class NetworkCount i want skip submitting manually, i wanted to invoke this spark app when a conditio

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Prashant Sharma
What is the motivation behind this ? You can start with master as local[NO_OF_THREADS]. Reducing the threads at all other places can have unexpected results. Take a look at this. http://spark.apache.org/docs/latest/configuration.html. Prashant Sharma On Tue, Oct 28, 2014 at 2:08 PM, Wanda Hawk

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk
I am trying to get a software trace and I need to get the number of active threads as low as I can in order to inspect the "active" part of the workload From: Prashant Sharma To: Wanda Hawk Cc: "user@spark.apache.org" Sent: Tuesday, October 28, 2014 11:17 A

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang
Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private class, it can only be called by function in mllib. 2014-10-28 17:09 GMT+08:00 Stephen Boesch : > I seem to recall there were some specific requirements on how to import > the implicits. > > Here is the issue: > > scala> impor

How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes
Hi, I am currently struggling with how to properly set Spark to perform only one map, flatMap, etc at once. In other words my map uses multi core algorithm so I would like to have only one map running to be able to use all the machine cores. Thank you in advance for advices and replies.  Jan 

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch
HI Yanbo, That is not the issue: notice that importing the object is fine: scala> import org.apache.spark.mllib.rdd.RDDFunctions import org.apache.spark.mllib.rdd.RDDFunctions scala> import org.apache.spark.mllib.rdd.RDDFunctions._ :11: error: object RDDFunctions in package rdd cannot be acces

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-28 Thread Stephen Boesch
I had an offline with Akhil, but this issue is still not resolved. 2014-10-24 0:18 GMT-07:00 Akhil Das : > Make sure the guava jar > is > present in the classpath. > > Thanks > Best Regards > > On Thu, Oct 23, 2014 at 2:13 PM, Stephe

SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when data is large. So how to tune it ? spark-defaults.conf: spark.shuffle.consolidateFiles true spark.shuffle.manager SORT spark.akka.threads 4 spark.sql.inMemoryColumnarStorage.compressed true

Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Mars Max
Currently we are using Hive in some products, however, seems maybe Spark SQL is a better choice. Is there any official comparation between them? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-Spark-SQL-and-Hiv

Re: Spark Shell strange worker Exception

2014-10-28 Thread Saket Kumar
Hi Paolo, The custom classes and jars are distributed across the Spark cluster via an HTTP server on the master when the absolute path of the application fat jar is specified in the spark-submit script. The Advanced Dependency Management section on https://spark.apache.org/docs/latest/submittin

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Cheng Lian
Which version of Spark and Hadoop are you using? Could you please provide the full stack trace of the exception? On Tue, Oct 28, 2014 at 5:48 AM, Du Li wrote: > Hi, > > I was trying to set up Spark SQL on a private cluster. I configured a > hive-site.xml under spark/conf that uses a local met

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-28 Thread Sasi
Thank you Akhil. You are correct it's about overlapped "thrift" libraries. We have taken reference from http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3Cdf6cde12e07c47f58bc6829a7c2167d6%40CHCXEXCHMBX001.SEA.CORP.EXPECN.com%3E link and given libraries in following order - a) cassa

How many executor process does an application receives?

2014-10-28 Thread shahab
Hi, I am running a stand alone Spark cluster, 2 workers each has 2 cores. I submit one Spakr application to the cluster, and I monitor the execution process via UI ( both worker-ip:8081 and master-ip:4040) There I can see that the application is handled by many Executors, in my case one worker has

Re: Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Yanbo Liang
You can refer the compare between different sql on hadoop solution such as hive, spark sql, shark, impala and so on. There are two main works which may be not very objectively, for your reference: Cloudera benchmark: http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
Hi, I got the following exceptions when using Spray client to write to OpenTSDB using its REST API. Exception in thread "pool-10-thread-2" java.lang.NoSuchMethodError: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext; It worked locally in my Intellij but failed when I laun

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Arpit Kumar
Any help regarding this issue please? Regards, Arpit On Sat, Oct 25, 2014 at 8:56 AM, Arpit Kumar wrote: > Hi all, > I am using the GrpahLoader class to load graphs from edge list files. But > then I need to change the storage level of the graph to some other thing > than MEMORY_ONLY. > > val g

newbie question quickstart example sbt issue

2014-10-28 Thread nl19856
Hi, I have downloaded the binary spark distribution. When building the package with sbt package I get the following: [root@nlvora157 ~]# sbt package [info] Set current project to Simple Project (in build file:/root/) [info] Updating {file:/root/}root... [info] Resolving org.apache.spark#spark-core_

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Yanbo Liang
Maybe you had wrong configuration of sbt proxy. 2014-10-28 18:27 GMT+08:00 nl19856 : > Hi, > I have downloaded the binary spark distribution. > When building the package with sbt package I get the following: > [root@nlvora157 ~]# sbt package > [info] Set current project to Simple Project (in buil

Re: newbie question quickstart example sbt issue

2014-10-28 Thread nl19856
Sigh! Sorry I did not read the error message properly. 2014-10-28 11:39 GMT+01:00 Yanbo Liang [via Apache Spark User List] < ml-node+s1001560n17478...@n3.nabble.com>: > Maybe you had wrong configuration of sbt proxy. > > 2014-10-28 18:27 GMT+08:00 nl19856 <[hidden email] >

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Akhil Das
Your proxy/dns could be blocking it. Thanks Best Regards On Tue, Oct 28, 2014 at 4:06 PM, Yanbo Liang wrote: > Maybe you had wrong configuration of sbt proxy. > > 2014-10-28 18:27 GMT+08:00 nl19856 : > >> Hi, >> I have downloaded the binary spark distribution. >> When building the package with

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-28 Thread Sasi
Add my message. On Tue, Oct 28, 2014 at 3:22 PM, Sasi [via Apache Spark User List] < ml-node+s1001560n17471...@n3.nabble.com> wrote: > Thank you Akhil. You are correct it's about overlapped "thrift" libraries. > We have taken reference from > http://mail-archives.apache.org/mod_mbox/spark-user/20

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang
The number of tasks is decided by the input partition numbers. If you want only one map or flatMap at once, just call coalesce() or repartition() to associate data into one partition. However, this is not recommend because it was not executed parallel efficiently. 2014-10-28 17:27 GMT+08:00 : > H

Re: SparkSql OutOfMemoryError

2014-10-28 Thread Yanbo Liang
Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo : > Hi,friends: > > I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails > when data is large. So how to tune it ? > > spark-defaults.conf: > > spark.shuffle.consolidateFiles true > spark.shuffle

Re: Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Mars Max
Got it, thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-Spark-SQL-and-Hive-tp17469p17484.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI
Hi Arpit, To try this: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK) Best, Yifan LI On 28 Oct 2014, at 11:17, Arpit Kumar wrote: > Any help re

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes
But I guess that this makes only one task over all the clusters nodes. I would like to run several tasks, but I would like Spark to not run more than one map at each of my nodes at one time. That means I would like to let's say have 4 different tasks and 2 nodes where each node has 2 cores. Cur

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Arpit Kumar
Hi Yifan LI, I am currently working on Spark 1.0 in which we can't pass edgeStorageLevel as parameter. It implicitly caches the edges. So I am looking for a workaround. http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.graphx.GraphLoader$ Regards, Arpit On Tue, Oct 28, 201

sbt error building spark : [FATAL] Non-resolvable parent POM:

2014-10-28 Thread nl19856
Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Everything seems to go smooth until : [info] downloading https://repo1.maven.org/maven2/org/ow2/asm/asm-tree/5.0.3/asm-tree-5.0.3.jar ... [info] [SUCCESSFUL ] org.ow2.asm#asm-tree;5.0.3!asm-tree.jar

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang
It's not very difficult to implement by properly set parameter of application. Some basic knowledge you should know: An application can have only one executor at each machine or container (YARN). So you just set executor-cores as 1, then each executor will make only one task at once. 2014-10-28 1

Suitability for spark for master worker distributed patterns...

2014-10-28 Thread Sasha Kacanski
Hi, Did anyone tried to replace gigaspaces implementation of master worker with spark standalone or hadoop driven implementation... I guess I am looking to find out what are pros and cons and if people tried it on the production side (grid or hadoop) Regards, -- Aleksandar Kacanski

Re: How many executor process does an application receives?

2014-10-28 Thread Yanbo Liang
An application can have only one executor at each machine or container (YARN). How many thread that each executor have is determined by the parameter "executor-cores". There are also other parameter setting method that you can specify "total- executor-cores" and each executor cores will be determin

Re: Re: SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
It works, thanks very much Zhanfeng Huo From: Yanbo Liang Date: 2014-10-28 18:50 To: Zhanfeng Huo CC: user Subject: Re: SparkSql OutOfMemoryError Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo : Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12, and

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang
Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not use any method in this class or even new an object of this class. So I infer that if you import org.apache.spark.mllib.rdd.RDDFunctions._, it may call some method of that object. 2014-10-28 17:29 GMT+08:00 Stephen Boesch :

Re: Batch of updates

2014-10-28 Thread Kamal Banga
Hi Flavio, Doing batch += ... shouldn't work. It will create new batch for each element in the myRDD (also val initializes an immutable variable, var is for mutable variables). You can use something like accumulators . val a

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI
I am not sure if it can work on Spark 1.0, but give it a try. or, Maybe you can try: 1) to construct the edges and vertices RDDs respectively with desired storage level. 2) then, to obtain a graph by using Graph(verticesRDD, edgesRDD). Best, Yifan LI On 28 Oct 2014, at 12:10, Arpit Kumar wr

How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread shahab
I am running a stand alone Spark cluster, 2 workers each has 2 cores. Apparently, I am loading and processing relatively large chunk of data so that I receive task failure " " . As I read from some posts and discussions in the mailing list the failures could be related to the large size of process

GraphX StackOverflowError

2014-10-28 Thread Zuhair Khayyat
Dear All, I am using connected components function of GraphX (on Spark 1.0.2) on some graph. However for some reason the fails with StackOverflowError. The graph is not too big; it contains 1 vertices and 50 edges. Can any one help me to avoid this error? Below is the output of Spark: 14/

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread Wanda Hawk
Is this what are you looking for ? In Shark, default reducer number is 1 and is controlled by the property mapred.reduce.tasks. Spark SQL deprecates this property in favor ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize this property via SET: SET spark.sql.shuffl

Deploying Spark on Stand alone cluster

2014-10-28 Thread TravisJ
I am trying to setup Apache-Spark on a small standalone cluster (1 Master Node and 8 Slave Nodes). I have installed the "pre-built" version of spark 1.1.0 built on top of Hadoop 2.4. I have set up the passwordless ssh between nodes and exported a few necessary environment variables. One of these va

Re: Is Spark the right tool?

2014-10-28 Thread Koert Kuipers
spark can definitely very quickly answer queries like "give me all transactions with property x". and you can put a http query server in front of it and run queries concurrently. but spark does not support inserts, updates, or fast random access lookups. this is because RDDs are immutable and desi

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread shahab
Thanks for the useful comment. But I guess this setting applies only when I use SparkSQL right= is there any similar settings for Spark? best, /Shahab On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wrote: > Is this what are you looking for ? > > In Shark, default reducer number is 1 and is contr

Re: What executes on worker and what executes on driver side

2014-10-28 Thread Kamal Banga
Can you please elaborate, I didn't get what you intended for me to read in that link. Regards. On Mon, Oct 20, 2014 at 7:03 PM, Saurabh Wadhawan < saurabh.wadha...@guavus.com> wrote: > What about: > > > http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_

Streaming window operations not producing output

2014-10-28 Thread diogo
Hi there, I'm trying to use Window operations on streaming, but everything I perform a windowed computation, I stop getting results. For example: val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() Will print the output to the stdout on 'batch duration' interval. Now if I replace it wit

Ending a job early

2014-10-28 Thread Jim Carroll
We have some very large datasets where the calculation converge on a result. Our current implementation allows us to track how quickly the calculations are converging and end the processing early. This can significantly speed up some of our processing. Is there a way to do the same thing is spark

pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Csaba Ragany
Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole cont

Re: Measuring Performance in Spark

2014-10-28 Thread mahsa
Thanks Akhil, So there is no tool that I can use right? My program is overloading some operators for some operation on images. I need to be accurate in the result. I try to work on your offered approach. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

install sbt

2014-10-28 Thread Pagliari, Roberto
Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,

java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036842471144

2014-10-28 Thread Ruebenacker, Oliver A
Hello, I have a Spark app which I run with master "local[3]". When running without any persist calls, it seems to work fine, but as soon as I add persist calls (at default storage level), it fails at the first persist call with the message below. Unfortunately, I can't post the code. Po

Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Brett Antonides
Hello, Given the following example customers.json file: { "name": "Sherlock Holmes", "customerNumber": 12345, "address": { "street": "221b Baker Street", "city": "London", "zipcode": "NW1 6XE", "country": "United Kingdom" } }, { "name": "Big Bird", "customerNumber": 10001, "address": { "street": "

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Ilya Ganelin
Hi all - I've simplified the code so now I'm literally feeding in 200 million ratings directly to ALS.train. Nothing else is happening in the program. I've also tried with both the regular serializer and the KryoSerializer. With Kryo, I get the same ArrayIndex exceptions. With the regular serializ

Re: Batch of updates

2014-10-28 Thread Sean Owen
You should use foreachPartition, and take care to open and close your connection following the pattern described in: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E Within a partition, you iterate over elemen

Re: install sbt

2014-10-28 Thread Ted Yu
Have you read this ? http://lancegatlin.org/tech/centos-6-install-sbt On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto wrote: > Is there a repo or some kind of instruction about how to install sbt for > centos? > > > > Thanks, > > >

Re: install sbt

2014-10-28 Thread Nicholas Chammas
If you're just calling sbt from within the spark/sbt folder, it should download and install automatically. Nick 2014년 10월 28일 화요일, Ted Yu님이 작성한 메시지: > Have you read this ? > http://lancegatlin.org/tech/centos-6-install-sbt > > On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto < > rpagli...@app

Re: install sbt

2014-10-28 Thread Soumya Simanta
sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS="-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled -

Saving to Cassandra from Spark Streaming

2014-10-28 Thread Harold Nguyen
Hi all, I'm having trouble troubleshooting this particular block of code for Spark Streaming and saving to Cassandra: val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,

JdbcRDD in Java

2014-10-28 Thread Ron Ayoub
The following line of code is indicating the constructor is not defined. The only examples I can find of usage of JdbcRDD is Scala examples. Does this work in Java? Is there any examples? Thanks. JdbcRDD rdd = new JdbcRDD(sp, () -> ods.getConnection(), sql, 1, 1783059, 1

Re: Saving to Cassandra from Spark Streaming

2014-10-28 Thread Gerard Maas
Looks like you're having some classpath issues. Are you providing your spark-cassandra-driver classes to your job? sparkConf.setJars(Seq(jars...)) ? On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen wrote: > Hi all, > > I'm having trouble troubleshooting this particular block of code for Spark > S

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany wrote: > Dear Spark Community, > > Is it possible to convert text files (.log or .txt files) into > sequencefiles in Python? > > Using PySpark I can create

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread Ilya Ganelin
In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, "shahab" wrote: > Thanks for the u

Re: Scala Spark IDE help

2014-10-28 Thread Matt Narrell
So, Im using Intellij 13.x, and Scala Spark jobs. Make sure you have singletons (objects, not classes), then simply debug the main function. You’ll need to set your master to some derivation of “local”, but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest behaves like

Re: Scala Spark IDE help

2014-10-28 Thread andy petrella
Also, I'm following to master students at the University of Liège (one for computing prob conditional density on massive data and the other implementing a Markov Chain method on georasters), I proposed them to use the Spark-Notebook to learn the framework, they're quite happy with it (so far at lea

Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers
doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) => Iterator[U] now

Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell
I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the contex

Re: Submiting Spark application through code

2014-10-28 Thread Matt Narrell
Can this be done? Can I just spin up a SparkContext programmatically, point this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least) the application JAR need to be distributed to the workers via HDFS or the like for the jobs to run? mn > On Oct 28, 2014, at 2:29 AM, Akhi

real-time streaming

2014-10-28 Thread ll
the spark tutorial shows that we can create a stream that reads "new files" from a directory. that seems to have some lag time, as we have to write the data to file first and then wait until spark stream picks it up. what is the best way to implement REAL 'REAL-TIME' streaming for analysis in r

Re: real-time streaming

2014-10-28 Thread jay vyas
a REAL TIME stream, by definition, delivers data every X seconds. you can easily do this with spark. roughly here is the way to create a stream gobbler and attach a spark app to read its data every X seconds - Write a Runnable thread which reads data from a source. Test that it works indepen

Re: real-time streaming

2014-10-28 Thread ll
thanks jay. do you think spark is a good fit for handling streaming & analyzing videos in real time? in this case, we're streaming 30 frames per second, and each frame is an image (size: roughly 500K - 1MB). we need to analyze every frame and return the analysis result back instantly in real ti

Re: JdbcRDD in Java

2014-10-28 Thread Sean Owen
That declaration looks OK for Java 8, at least when I tried it just now vs master. The only thing I see wrong here is getInt throws an exception which means the lambda has to be more complicated than this. This is Java code here calling the constructor so yes it can work fine from Java (8). On Tue

Re: Spark Streaming and Storm

2014-10-28 Thread critikaled
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-Storm-tp9118p17530.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
Here's the answer I got from Akka's user ML. """ This looks like a binary incompatibility issue. As far as I know Spark is using a custom built Akka and Scala for various reasons. You should ask this on the Spark mailing list, Akka is binary compatible between major versions (2.3.6 is compatible

Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Harold Nguyen
Hi all, The following works fine when submitting dependency jars through Spark-Shell: ./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars /home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2

Re: Ending a job early

2014-10-28 Thread Patrick Wendell
Hey Jim, There are some experimental (unstable) API's that support running jobs which might short-circuit: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126 This can be used for doing online aggregations like you are describing. And in one

Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that i

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
I am using Hadoop 2.5.0.3 and spark 1.1. My local hive version is 0.12.3 the hcatalog.jar of which is included in the path. The stack trace is as follows: 14/10/28 18:24:24 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlExceptio

Re: RDD to Multiple Tables SparkSQL

2014-10-28 Thread critikaled
Hi oliver, thanks for the answer I don't have the information of all keys before hand, the reason i want to have multiple tables is that based on my information on known key I will apply different queries get the results for that particular key I don't want to touch the unkown ones I'll save that f

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
If I put all the jar files from my local hive in the front of the spark class path, a different error was reported, as follows: 14/10/28 18:29:40 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: PLAIN auth failed: null at org.apache.hadoop.security.S

Re: Is Spark in Java a bad idea?

2014-10-28 Thread critikaled
Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Helena Edelson
Hi Harold, It seems like, based on your previous post, you are using one version of the connector as a dependency yet building the assembly jar from master? You were using 1.1.0-alpha3 (you can upgrade to alpha4, beta coming this week) yet your assembly is spark-cassandra-connector-assembly-1.2.

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-fri

  1   2   >