date:20141028

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-28 Thread Sean Owen

It might kind of work, but you are effectively making all of your workers into mini, separate Spark drivers in their own right. This might cause snags down the line as this isn't the normal thing to do. On Tue, Oct 28, 2014 at 12:11 AM, Localhost shell wrote: > Hey lordjoe, > > Apologies for the

Re: Is Spark the right tool?

2014-10-28 Thread Akhil

You can use sparkstreaming to get the transactions from those TCP Connections periodically and you can push the data into HBase accordingly. Now, regarding the querying part, you can use a database like redis which actually does the key, value storing for you. You can use the RDDs to query (insert,

Why RDD is not cached?

2014-10-28 Thread shahab

Hi, I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores. I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see " BlockManagerMasterActor: Adde

Re: Why RDD is not cached?

2014-10-28 Thread Jagat Singh

What setting you are using for persist() or cache() http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Tue, Oct 28, 2014 at 6:18 PM, shahab wrote: > Hi, > > I have a standalone spark , where the executor is set to have 6.3 G memory > , as I am using two workers so in

Re: sampling in spark

2014-10-28 Thread Chengi Liu

Oops, the reference for the above code: http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu wrote: > Hi, > I have three rdds.. X,y and p > X is matrix rdd (mXn), y is (mX1) dimension ve

sampling in spark

2014-10-28 Thread Chengi Liu

Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y is (mX1) dimension vector and p is (mX1) dimension probability vector. Now, I am trying to sample k rows from X and corresponding entries in y based on probability vector p. Here is the python implementation import randomfrom bisect impo

Re: Why RDD is not cached?

2014-10-28 Thread Sean Owen

Did you just call cache()? By itself it does nothing but once an action requires it to be computed it should become cached. On Oct 28, 2014 8:19 AM, "shahab" wrote: > Hi, > > I have a standalone spark , where the executor is set to have 6.3 G memory > , as I am using two workers so in total there

Re: sampling in spark

2014-10-28 Thread Davies Liu

_cumm = [p[0]] for i in range(1, len(p)): _cumm.append(_cumm[-1] + p[i]) index = set([bisect(_cumm, random.random()) for i in range(k)]) chosed_x = X.zipWithIndex().filter(lambda (v, i): i in index).map(lambda (v, i): v) chosed_y = [v for i, v in

Re: sampling in spark

2014-10-28 Thread Chengi Liu

Is there an equivalent way of doing the following: a = [1,2,3,4] reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:] ?? The issue with above suggestion is that population is a hefty data structure :-/ On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu wrote: > _cumm = [p[0]] > for i in r

Singapore Meetup

2014-10-28 Thread Social Marketing

Dear Sir/Madam, This is Songtao, live in Singapore, doing some research with big data projects in NUS. I want to be an organiser for Singapore Meetup. Thanks. Songao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Why RDD is not cached?

2014-10-28 Thread shahab

I used Cache followed by a "count" on RDD to ensure that caching is performed. val rdd = srdd.flatMap(mapProfile_To_Sessions).cache val count = rdd.count //so at this point RDD should be cahed ? right? On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen wrote: > Did you just call cache()? By itself

Submiting Spark application through code

2014-10-28 Thread sivarani

Hi, i am submitting spark application in the following fashion bin/spark-submit --class "NetworkCount" --master spark://abc.test.com:7077 try/simple-project/target/simple-project-1.0-jar-with-dependencies.jar But is there any other way to submit spark application through the code? like for ex

Re: Spark Streaming Applications

2014-10-28 Thread sivarani

Hi tdas, is it possible to run spark 24/7, i am using updateStateByKey and i am streaming 3lac records in 1/2 hr, i am not getting the correct result also i am not not able to run spark streaming for 24/7 after hew hrs i get array out of bound exception even if i am not streaming anything? btw will

Re: Spark Streaming - How to remove state for key

2014-10-28 Thread sivarani

I am having the same issue, i am using update stateBykey and over a period a set of data will not change i would like save it and delete it from state.. have you found the answer? please share your views. Thanks for your time -- View this message in context: http://apache-spark-user-list.100156

Re: Submiting Spark application through code

2014-10-28 Thread Akhil Das

How about directly running it? val ssc = new StreamingContext("local[2]","Network WordCount",Seconds(5), "/home/akhld/mobi/localclusterxx/spark-1") val lines=ssc.socketTextStream("localhost", 12345) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread buring

yes ,I use standalone mode,I have set the "spark.io.compression.codec" with code : conf.set("spark.io.compression.codec","org.apache.spark.io.LZ4CompressionCodec") It seems have no influence on function "saveAsSequenceFile" which still used snappy compression internal. Thanks. -- Vi

Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk

Hello I am trying to reduce the number of java threads (about 80 on my system) to as few as there can be. What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places as well) I am also using hadoop for storing data on hdfs Thank you, Wanda

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread Shixiong Zhu

I mean updating the spark conf not only in the driver, but also in the Spark Workers. Because the driver configurations cannot be read by the Executors, they still use the default spark.io.compression.codec to deserialize the tasks. Best Regards, Shixiong Zhu 2014-10-28 16:39 GMT+08:00 buring :

How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch

I seem to recall there were some specific requirements on how to import the implicits. Here is the issue: scala> import org.apache.spark.mllib.rdd.RDDFunctions._ :10: error: object RDDFunctions in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import org.apache.spark.

Re: what classes are needed to register in KryoRegistrator, e.g. Row?

2014-10-28 Thread Fengyun RAO

Although nobody answers, as I tested, Row, MutableValue and there subclasses are not registered by default, which I think should be, since they would absolutely show up in Spark SQL. 2014-10-26 23:43 GMT+08:00 Fengyun RAO : > In Tuning Spark ,

Re: Submiting Spark application through code

2014-10-28 Thread sivarani

Hi I know we can create spark context with new JavaStreamingContext(master, appName, batchDuration, sparkHome, jarFile) but to run the application we will have to use spark-home/spark-submit --class NetworkCount i want skip submitting manually, i wanted to invoke this spark app when a conditio

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Prashant Sharma

What is the motivation behind this ? You can start with master as local[NO_OF_THREADS]. Reducing the threads at all other places can have unexpected results. Take a look at this. http://spark.apache.org/docs/latest/configuration.html. Prashant Sharma On Tue, Oct 28, 2014 at 2:08 PM, Wanda Hawk

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk

I am trying to get a software trace and I need to get the number of active threads as low as I can in order to inspect the "active" part of the workload From: Prashant Sharma To: Wanda Hawk Cc: "user@spark.apache.org" Sent: Tuesday, October 28, 2014 11:17 A

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang

Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private class, it can only be called by function in mllib. 2014-10-28 17:09 GMT+08:00 Stephen Boesch : > I seem to recall there were some specific requirements on how to import > the implicits. > > Here is the issue: > > scala> impor

How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes

Hi, I am currently struggling with how to properly set Spark to perform only one map, flatMap, etc at once. In other words my map uses multi core algorithm so I would like to have only one map running to be able to use all the machine cores. Thank you in advance for advices and replies. Jan

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch

HI Yanbo, That is not the issue: notice that importing the object is fine: scala> import org.apache.spark.mllib.rdd.RDDFunctions import org.apache.spark.mllib.rdd.RDDFunctions scala> import org.apache.spark.mllib.rdd.RDDFunctions._ :11: error: object RDDFunctions in package rdd cannot be acces

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-28 Thread Stephen Boesch

I had an offline with Akhil, but this issue is still not resolved. 2014-10-24 0:18 GMT-07:00 Akhil Das : > Make sure the guava jar > is > present in the classpath. > > Thanks > Best Regards > > On Thu, Oct 23, 2014 at 2:13 PM, Stephe

SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo

Hi，friends: I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when data is large. So how to tune it ? spark-defaults.conf: spark.shuffle.consolidateFiles true spark.shuffle.manager SORT spark.akka.threads 4 spark.sql.inMemoryColumnarStorage.compressed true

Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Mars Max

Currently we are using Hive in some products, however, seems maybe Spark SQL is a better choice. Is there any official comparation between them? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-Spark-SQL-and-Hiv

Re: Spark Shell strange worker Exception

2014-10-28 Thread Saket Kumar

Hi Paolo, The custom classes and jars are distributed across the Spark cluster via an HTTP server on the master when the absolute path of the application fat jar is specified in the spark-submit script. The Advanced Dependency Management section on https://spark.apache.org/docs/latest/submittin

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Cheng Lian

Which version of Spark and Hadoop are you using? Could you please provide the full stack trace of the exception? On Tue, Oct 28, 2014 at 5:48 AM, Du Li wrote: > Hi, > > I was trying to set up Spark SQL on a private cluster. I configured a > hive-site.xml under spark/conf that uses a local met

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-28 Thread Sasi

Thank you Akhil. You are correct it's about overlapped "thrift" libraries. We have taken reference from http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3Cdf6cde12e07c47f58bc6829a7c2167d6%40CHCXEXCHMBX001.SEA.CORP.EXPECN.com%3E link and given libraries in following order - a) cassa

How many executor process does an application receives?

2014-10-28 Thread shahab

Hi, I am running a stand alone Spark cluster, 2 workers each has 2 cores. I submit one Spakr application to the cluster, and I monitor the execution process via UI ( both worker-ip:8081 and master-ip:4040) There I can see that the application is handled by many Executors, in my case one worker has

Re: Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Yanbo Liang

You can refer the compare between different sql on hadoop solution such as hive, spark sql, shark, impala and so on. There are two main works which may be not very objectively, for your reference: Cloudera benchmark: http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang

Hi, I got the following exceptions when using Spray client to write to OpenTSDB using its REST API. Exception in thread "pool-10-thread-2" java.lang.NoSuchMethodError: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext; It worked locally in my Intellij but failed when I laun

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Arpit Kumar

Any help regarding this issue please? Regards, Arpit On Sat, Oct 25, 2014 at 8:56 AM, Arpit Kumar wrote: > Hi all, > I am using the GrpahLoader class to load graphs from edge list files. But > then I need to change the storage level of the graph to some other thing > than MEMORY_ONLY. > > val g

newbie question quickstart example sbt issue

2014-10-28 Thread nl19856

Hi, I have downloaded the binary spark distribution. When building the package with sbt package I get the following: [root@nlvora157 ~]# sbt package [info] Set current project to Simple Project (in build file:/root/) [info] Updating {file:/root/}root... [info] Resolving org.apache.spark#spark-core_

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Yanbo Liang

Maybe you had wrong configuration of sbt proxy. 2014-10-28 18:27 GMT+08:00 nl19856 : > Hi, > I have downloaded the binary spark distribution. > When building the package with sbt package I get the following: > [root@nlvora157 ~]# sbt package > [info] Set current project to Simple Project (in buil

Re: newbie question quickstart example sbt issue

2014-10-28 Thread nl19856

Sigh! Sorry I did not read the error message properly. 2014-10-28 11:39 GMT+01:00 Yanbo Liang [via Apache Spark User List] < ml-node+s1001560n17478...@n3.nabble.com>: > Maybe you had wrong configuration of sbt proxy. > > 2014-10-28 18:27 GMT+08:00 nl19856 <[hidden email] >

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Akhil Das

Your proxy/dns could be blocking it. Thanks Best Regards On Tue, Oct 28, 2014 at 4:06 PM, Yanbo Liang wrote: > Maybe you had wrong configuration of sbt proxy. > > 2014-10-28 18:27 GMT+08:00 nl19856 : > >> Hi, >> I have downloaded the binary spark distribution. >> When building the package with

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-28 Thread Sasi

Add my message. On Tue, Oct 28, 2014 at 3:22 PM, Sasi [via Apache Spark User List] < ml-node+s1001560n17471...@n3.nabble.com> wrote: > Thank you Akhil. You are correct it's about overlapped "thrift" libraries. > We have taken reference from > http://mail-archives.apache.org/mod_mbox/spark-user/20

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang

The number of tasks is decided by the input partition numbers. If you want only one map or flatMap at once, just call coalesce() or repartition() to associate data into one partition. However, this is not recommend because it was not executed parallel efficiently. 2014-10-28 17:27 GMT+08:00 : > H

Re: SparkSql OutOfMemoryError

2014-10-28 Thread Yanbo Liang

Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo : > Hi，friends: > > I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails > when data is large. So how to tune it ? > > spark-defaults.conf: > > spark.shuffle.consolidateFiles true > spark.shuffle

Re: Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Mars Max

Got it, thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-Spark-SQL-and-Hive-tp17469p17484.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI

Hi Arpit, To try this: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK) Best, Yifan LI On 28 Oct 2014, at 11:17, Arpit Kumar wrote: > Any help re

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes

But I guess that this makes only one task over all the clusters nodes. I would like to run several tasks, but I would like Spark to not run more than one map at each of my nodes at one time. That means I would like to let's say have 4 different tasks and 2 nodes where each node has 2 cores. Cur

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Arpit Kumar

Hi Yifan LI, I am currently working on Spark 1.0 in which we can't pass edgeStorageLevel as parameter. It implicitly caches the edges. So I am looking for a workaround. http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.graphx.GraphLoader$ Regards, Arpit On Tue, Oct 28, 201

sbt error building spark : [FATAL] Non-resolvable parent POM:

2014-10-28 Thread nl19856

Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Everything seems to go smooth until : [info] downloading https://repo1.maven.org/maven2/org/ow2/asm/asm-tree/5.0.3/asm-tree-5.0.3.jar ... [info] [SUCCESSFUL ] org.ow2.asm#asm-tree;5.0.3!asm-tree.jar

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang

It's not very difficult to implement by properly set parameter of application. Some basic knowledge you should know: An application can have only one executor at each machine or container (YARN). So you just set executor-cores as 1, then each executor will make only one task at once. 2014-10-28 1

Suitability for spark for master worker distributed patterns...

2014-10-28 Thread Sasha Kacanski

Hi, Did anyone tried to replace gigaspaces implementation of master worker with spark standalone or hadoop driven implementation... I guess I am looking to find out what are pros and cons and if people tried it on the production side (grid or hadoop) Regards, -- Aleksandar Kacanski

Re: How many executor process does an application receives?

2014-10-28 Thread Yanbo Liang

An application can have only one executor at each machine or container (YARN). How many thread that each executor have is determined by the parameter "executor-cores". There are also other parameter setting method that you can specify "total- executor-cores" and each executor cores will be determin

Re: Re: SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo

It works， thanks very much Zhanfeng Huo From: Yanbo Liang Date: 2014-10-28 18:50 To: Zhanfeng Huo CC: user Subject: Re: SparkSql OutOfMemoryError Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo : Hi，friends: I use spark(spark 1.1) sql operate data in hive-0.12, and

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang

Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not use any method in this class or even new an object of this class. So I infer that if you import org.apache.spark.mllib.rdd.RDDFunctions._, it may call some method of that object. 2014-10-28 17:29 GMT+08:00 Stephen Boesch :

Re: Batch of updates

2014-10-28 Thread Kamal Banga

Hi Flavio, Doing batch += ... shouldn't work. It will create new batch for each element in the myRDD (also val initializes an immutable variable, var is for mutable variables). You can use something like accumulators . val a

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI

I am not sure if it can work on Spark 1.0, but give it a try. or, Maybe you can try: 1) to construct the edges and vertices RDDs respectively with desired storage level. 2) then, to obtain a graph by using Graph(verticesRDD, edgesRDD). Best, Yifan LI On 28 Oct 2014, at 12:10, Arpit Kumar wr

How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread shahab

I am running a stand alone Spark cluster, 2 workers each has 2 cores. Apparently, I am loading and processing relatively large chunk of data so that I receive task failure " " . As I read from some posts and discussions in the mailing list the failures could be related to the large size of process

GraphX StackOverflowError

2014-10-28 Thread Zuhair Khayyat

Dear All, I am using connected components function of GraphX (on Spark 1.0.2) on some graph. However for some reason the fails with StackOverflowError. The graph is not too big; it contains 1 vertices and 50 edges. Can any one help me to avoid this error? Below is the output of Spark: 14/

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread Wanda Hawk

Is this what are you looking for ? In Shark, default reducer number is 1 and is controlled by the property mapred.reduce.tasks. Spark SQL deprecates this property in favor ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize this property via SET: SET spark.sql.shuffl

Deploying Spark on Stand alone cluster

2014-10-28 Thread TravisJ

I am trying to setup Apache-Spark on a small standalone cluster (1 Master Node and 8 Slave Nodes). I have installed the "pre-built" version of spark 1.1.0 built on top of Hadoop 2.4. I have set up the passwordless ssh between nodes and exported a few necessary environment variables. One of these va

Re: Is Spark the right tool?

2014-10-28 Thread Koert Kuipers

spark can definitely very quickly answer queries like "give me all transactions with property x". and you can put a http query server in front of it and run queries concurrently. but spark does not support inserts, updates, or fast random access lookups. this is because RDDs are immutable and desi

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread shahab

Thanks for the useful comment. But I guess this setting applies only when I use SparkSQL right= is there any similar settings for Spark? best, /Shahab On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wrote: > Is this what are you looking for ? > > In Shark, default reducer number is 1 and is contr

Re: What executes on worker and what executes on driver side

2014-10-28 Thread Kamal Banga

Can you please elaborate, I didn't get what you intended for me to read in that link. Regards. On Mon, Oct 20, 2014 at 7:03 PM, Saurabh Wadhawan < saurabh.wadha...@guavus.com> wrote: > What about: > > > http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_

Streaming window operations not producing output

2014-10-28 Thread diogo

Hi there, I'm trying to use Window operations on streaming, but everything I perform a windowed computation, I stop getting results. For example: val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() Will print the output to the stdout on 'batch duration' interval. Now if I replace it wit

Ending a job early

2014-10-28 Thread Jim Carroll

We have some very large datasets where the calculation converge on a result. Our current implementation allows us to track how quickly the calculations are converging and end the processing early. This can significantly speed up some of our processing. Is there a way to do the same thing is spark

pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Csaba Ragany

Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole cont

Re: Measuring Performance in Spark

2014-10-28 Thread mahsa

Thanks Akhil, So there is no tool that I can use right? My program is overloading some operators for some operation on images. I need to be accurate in the result. I try to work on your offered approach. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

install sbt

2014-10-28 Thread Pagliari, Roberto

Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,

java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036842471144

2014-10-28 Thread Ruebenacker, Oliver A

Hello, I have a Spark app which I run with master "local[3]". When running without any persist calls, it seems to work fine, but as soon as I add persist calls (at default storage level), it fails at the first persist call with the message below. Unfortunately, I can't post the code. Po

Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Brett Antonides

Hello, Given the following example customers.json file: { "name": "Sherlock Holmes", "customerNumber": 12345, "address": { "street": "221b Baker Street", "city": "London", "zipcode": "NW1 6XE", "country": "United Kingdom" } }, { "name": "Big Bird", "customerNumber": 10001, "address": { "street": "

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Ilya Ganelin

Hi all - I've simplified the code so now I'm literally feeding in 200 million ratings directly to ALS.train. Nothing else is happening in the program. I've also tried with both the regular serializer and the KryoSerializer. With Kryo, I get the same ArrayIndex exceptions. With the regular serializ

Re: Batch of updates

2014-10-28 Thread Sean Owen

You should use foreachPartition, and take care to open and close your connection following the pattern described in: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E Within a partition, you iterate over elemen

Re: install sbt

2014-10-28 Thread Ted Yu

Have you read this ? http://lancegatlin.org/tech/centos-6-install-sbt On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto wrote: > Is there a repo or some kind of instruction about how to install sbt for > centos? > > > > Thanks, > > >

Re: install sbt

2014-10-28 Thread Nicholas Chammas

If you're just calling sbt from within the spark/sbt folder, it should download and install automatically. Nick 2014년 10월 28일 화요일, Ted Yu님이 작성한 메시지: > Have you read this ? > http://lancegatlin.org/tech/centos-6-install-sbt > > On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto < > rpagli...@app

Re: install sbt

2014-10-28 Thread Soumya Simanta

sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS="-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled -

Saving to Cassandra from Spark Streaming

2014-10-28 Thread Harold Nguyen

Hi all, I'm having trouble troubleshooting this particular block of code for Spark Streaming and saving to Cassandra: val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,

JdbcRDD in Java

2014-10-28 Thread Ron Ayoub

The following line of code is indicating the constructor is not defined. The only examples I can find of usage of JdbcRDD is Scala examples. Does this work in Java? Is there any examples? Thanks. JdbcRDD rdd = new JdbcRDD(sp, () -> ods.getConnection(), sql, 1, 1783059, 1

Re: Saving to Cassandra from Spark Streaming

2014-10-28 Thread Gerard Maas

Looks like you're having some classpath issues. Are you providing your spark-cassandra-driver classes to your job? sparkConf.setJars(Seq(jars...)) ? On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen wrote: > Hi all, > > I'm having trouble troubleshooting this particular block of code for Spark > S

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau

Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany wrote: > Dear Spark Community, > > Is it possible to convert text files (.log or .txt files) into > sequencefiles in Python? > > Using PySpark I can create

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread Ilya Ganelin

In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, "shahab" wrote: > Thanks for the u

Re: Scala Spark IDE help

2014-10-28 Thread Matt Narrell

So, Im using Intellij 13.x, and Scala Spark jobs. Make sure you have singletons (objects, not classes), then simply debug the main function. You’ll need to set your master to some derivation of “local”, but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest behaves like

Re: Scala Spark IDE help

2014-10-28 Thread andy petrella

Also, I'm following to master students at the University of Liège (one for computing prob conditional density on massive data and the other implementing a Markov Chain method on georasters), I proposed them to use the Spark-Notebook to learn the framework, they're quite happy with it (so far at lea

Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers

doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) => Iterator[U] now

Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell

I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the contex

Re: Submiting Spark application through code

2014-10-28 Thread Matt Narrell

Can this be done? Can I just spin up a SparkContext programmatically, point this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least) the application JAR need to be distributed to the workers via HDFS or the like for the jobs to run? mn > On Oct 28, 2014, at 2:29 AM, Akhi

real-time streaming

2014-10-28 Thread ll

the spark tutorial shows that we can create a stream that reads "new files" from a directory. that seems to have some lag time, as we have to write the data to file first and then wait until spark stream picks it up. what is the best way to implement REAL 'REAL-TIME' streaming for analysis in r

Re: real-time streaming

2014-10-28 Thread jay vyas

a REAL TIME stream, by definition, delivers data every X seconds. you can easily do this with spark. roughly here is the way to create a stream gobbler and attach a spark app to read its data every X seconds - Write a Runnable thread which reads data from a source. Test that it works indepen

Re: real-time streaming

2014-10-28 Thread ll

thanks jay. do you think spark is a good fit for handling streaming & analyzing videos in real time? in this case, we're streaming 30 frames per second, and each frame is an image (size: roughly 500K - 1MB). we need to analyze every frame and return the analysis result back instantly in real ti

Re: JdbcRDD in Java

2014-10-28 Thread Sean Owen

That declaration looks OK for Java 8, at least when I tried it just now vs master. The only thing I see wrong here is getInt throws an exception which means the lambda has to be more complicated than this. This is Java code here calling the constructor so yes it can work fine from Java (8). On Tue

Re: Spark Streaming and Storm

2014-10-28 Thread critikaled

http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-Storm-tp9118p17530.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang

Here's the answer I got from Akka's user ML. """ This looks like a binary incompatibility issue. As far as I know Spark is using a custom built Akka and Scala for various reasons. You should ask this on the Spark mailing list, Akka is binary compatible between major versions (2.3.6 is compatible

Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Harold Nguyen

Hi all, The following works fine when submitting dependency jars through Spark-Shell: ./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars /home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2

Re: Ending a job early

2014-10-28 Thread Patrick Wendell

Hey Jim, There are some experimental (unstable) API's that support running jobs which might short-circuit: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126 This can be used for doing online aggregations like you are describing. And in one

Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub

I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that i

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li

I am using Hadoop 2.5.0.3 and spark 1.1. My local hive version is 0.12.3 the hcatalog.jar of which is included in the path. The stack trace is as follows: 14/10/28 18:24:24 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlExceptio

Re: RDD to Multiple Tables SparkSQL

2014-10-28 Thread critikaled

Hi oliver, thanks for the answer I don't have the information of all keys before hand, the reason i want to have multiple tables is that based on my information on known key I will apply different queries get the results for that particular key I don't want to touch the unkown ones I'll save that f

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li

If I put all the jar files from my local hive in the front of the spark class path, a different error was reported, as follows: 14/10/28 18:29:40 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: PLAIN auth failed: null at org.apache.hadoop.security.S

Re: Is Spark in Java a bad idea?

2014-10-28 Thread critikaled

Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001

Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Helena Edelson

Hi Harold, It seems like, based on your previous post, you are using one version of the connector as a dependency yet building the assembly jar from master? You were using 1.1.0-alpha3 (you can upgrade to alpha4, beta coming this week) yet your assembly is spark-cassandra-connector-assembly-1.2.

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia

A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-fri

1 2 >

1 - 100 of 169 matches

Mail list logo