Re: spark streaming doubt

2015-05-20 Thread Akhil Das
Kafka >> Low Level Consumer API. >> >> http://spark-packages.org/package/dibbhatt/kafka-spark-consumer >> >> >> Regards, >> Dibyendu >> >> On Tue, May 19, 2015 at 9:00 PM, Akhil Das >> wrote: >> >>> >>> On Tue, May 19, 2015

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
expected. It is when we are calling > collect() or toArray() methods, the exception is coming. > Something to do with Text class even though I haven't used it in the > program. > > Regards > Tapan > > On Tue, May 19, 2015 at 6:26 PM, Akhil Das > wrote: > >

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
at 5 sec​ ​it will consumer it (you can also limit the rate with spark.streaming.kafka.maxRatePerPartition)​ Read more here https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md​ > > > On Wed, May 20, 2015 at 12:36 PM, Akhil Das > wrote: > >> One receiver

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
a sequence file:* > > http://stuartsierra.com/2008/04/24/a-million-little-files > > > Regards > > Tapan > > > > On Wed, May 20, 2015 at 12:42 PM, Akhil Das > wrote: > >> If you can share the complete code and a sample file, may be i can try to >> reproduce

Re: java program Get Stuck at broadcasting

2015-05-20 Thread Akhil Das
This is more like an issue with your HDFS setup, can you check in the datanode logs? Also try putting a new file in HDFS and see if that works. Thanks Best Regards On Wed, May 20, 2015 at 11:47 AM, allanjie wrote: > ​Hi All, > The variable I need to broadcast is just 468 MB. > > > When broadcas

Re: Spark users

2015-05-20 Thread Akhil Das
Yes, this is the user group. Feel free to ask your questions in this list. Thanks Best Regards On Wed, May 20, 2015 at 5:58 AM, Ricardo Goncalves da Silva < ricardog.si...@telefonica.com> wrote: > Hi > I'm learning spark focused on data and machine learning. Migrating from > SAS. > > There is a

Re: rdd.saveAsTextFile problem

2015-05-20 Thread Akhil Das
This thread happened a year back, can you please share what issue you are facing? which version of spark you are using? What is your system environment? Exception stack-trace? Thanks Best Regards On Thu, May 21, 2015 at 12:19 PM, Keerthi wrote: > Hi , > > I had tried the workaround shared here,

Re: How to set the file size for parquet Part

2015-05-21 Thread Akhil Das
How many part files are you having? Did you try re-partitioning to a smaller number so that you will have bigger files of smaller number. Thanks Best Regards On Wed, May 20, 2015 at 3:06 AM, Richard Grossman wrote: > Hi > > I'm using spark 1.3.1 and now I can't set the size of the part generate

Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory. We have modified the sparkstreaming code base to read nested files from S3, you can check this function

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Akhil Das
> But why this is a hdfs issue, because I think spark broadcast the variable > in memory. > BTW, the datanode logs seems like I don't have any space to save the > storage. So the question comes back, originally I have 50GB for HDFS, why I > broadcast a variable and then that var

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
th* Variable to > add *bin* directory of *HADOOP_HOME* (say*C:\hadoop\bin*). > fix this issue in my env > > 2015-05-21 9:55 GMT+03:00 Akhil Das : > >> This thread happened a year back, can you please share what issue you are >> facing? which version of spark you are us

Re: java program got Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you try commenting the saveAsTextFile and do a simple count()? If its a broadcast issue, then it would throw up the same error. On 21 May 2015 14:21, "allanjie" wrote: > Sure, the code is very simple. I think u guys can understand from the main > function. > > public class Test1 { > >

Re: Spark Memory management

2015-05-22 Thread Akhil Das
You can look at the logic for offloading data from Memory by looking at ensureFreeSpace call. And dropFromMemory

Re: how to distributed run a bash shell in spark

2015-05-24 Thread Akhil Das
You mean you want to execute some shell commands from spark? Here's something i tried a while back. https://github.com/akhld/spark-exploit Thanks Best Regards On Sun, May 24, 2015 at 4:53 PM, wrote: > hello there > > I am trying to run a app in which part of it needs to run a > shell.how

Re: Trying to connect to many topics with several DirectConnect

2015-05-24 Thread Akhil Das
I used to hit a NPE when i don't add all the dependency jars to my context while running it in standalone mode. Can you try adding all these dependencies to your context? sc.addJar("/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.3.1.jar")

Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
nmodifyshell) > val pipeModify = runmodifyshellRDD.pipe("sh > /opt/data/shellcompare/modify.sh") > > pipeModify.collect() > > > > //running on driver manager > val shellcompare = List("run","sort.sh") > val shellcompareRDD = sc.make

Re: How to use zookeeper in Spark Streaming

2015-05-25 Thread Akhil Das
If you want to notify after every batch is completed, then you can simply implement the StreamingListener interface, which has methods like onBatchCompleted, onBatchStarted etc in which

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Akhil Das
Try this way: object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName)} val someRdd = spark.parallelize(List(1, 2, 3)) someRdd.map { element => Holder.*log.info (s"$element will be processed")* element + 1

Re: IPv6 support

2015-05-25 Thread Akhil Das
Hi Kevin, Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark failed for me when i use just the ipv6 addresses, but it works fine when i use the host names. Here's an entry in my /etc/hosts: 2607:5300:0100:0200::::0a4d hacked.work My spark-env.sh file: expo

Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
some charactors > not useful. > > 4.run shellcompare.sh to compare chr1.txt and samplechr1.txt, get a > result1.txt. And looping it from 1 to 21 so that those 42 file are compared > and I can get 21 files like result1.txt,result2.txt...result21.txt. > > Sorry for not adding some comme

Re: Remove COMPLETED applications and shuffle data

2015-05-26 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set("spark.executor.logs.rolling.strategy", "size") .set("spark.executor.logs.rolling.size.maxBytes", "1024") .set("spark.executor.logs.rolling.maxRetainedFiles", "3") You can also

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
After submitting the job, if you do a ps aux | grep spark-submit then you can see all JVM params. Are you using the highlevel consumer (receiver based) for receiving data from Kafka? In that case if your throughput is high and the processing delay exceeds batch interval then you will hit this memor

Re: How to give multiple directories as input ?

2015-05-27 Thread Akhil Das
How about creating two and union [ sc.union(first, second) ] them? Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have this piece > > sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, > AvroKeyInputFormat[GenericRecord]]( > "/a/b/c/d/exptsession/2015/05/2

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
ed about how to check the off-heap memory usage, there's a tool > called pmap, but I don't know how to interprete the results. > > On Wed, May 27, 2015 at 3:08 PM, Akhil Das > wrote: > >> After submitting the job, if you do a ps aux | grep spark-submit then you >>

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
SK_SER_2. Besides, the driver's memory is also growing. I > don't think Kafka messages will be cached in driver. > > > On Thu, May 28, 2015 at 12:24 AM, Akhil Das > wrote: > >> Are you using the createStream or createDirectStream api? If its the >> former, yo

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
I do this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master public key to .ssh/authorized_keys - Add the slaves internal IP to the master's conf/slaves file - do sbin/start-all.sh and it will sho

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
t it too aggressive? Let's say I have 20 > slaves up, and I want to add one more, why should we stop the entire > cluster for this? > > thanks, nizan > > On Thu, May 28, 2015 at 10:19 AM, Akhil Das > wrote: > >> I do this way: >> >> - Launch a new instan

Re: Get all servers in security group in bash(ec2)

2015-05-28 Thread Akhil Das
You can use python boto library for that, in fact spark-ec2 script uses it underneath. Here's the call spark-ec2 is making to get all machines under a given security group. Thanks Best Regards On Thu, May 28, 2015 at 2:22 PM, niz

Re: why does "com.esotericsoftware.kryo.KryoException: java.u til.ConcurrentModificationException" happen?

2015-05-28 Thread Akhil Das
Can you paste the piece of code at least? Not sure, but it seems you are reading/writing an object at the same time. You can try disabling kryo and that might give you a proper exception stack. Thanks Best Regards On Thu, May 28, 2015 at 2:45 PM, randylu wrote: > begs for your help > > > > -- >

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
} > > logs.filter(_.s_id > 0).count.foreachRDD { rdd => > rdd.foreachPartition { iter => > iter.foreach(count => logger.info(count.toString)) > } > } > > It receives messages from Kafka, parse the json, filter and count the > rec

Re: Spark streaming with kafka

2015-05-29 Thread Akhil Das
Just after receiving the data from kafka, you can do a dstream.count().print() to see spark and kafka is not the problem, after that next step would be to identify where is the problem, you can do the same count and print on each of the dstreams that you are creating (by transforming), and finally,

Re: import CSV file using read.csv

2015-05-31 Thread Akhil Das
If it is spark related, then Something like this? csv = sc.textFile("hdfs:///stats/test.csv").map(*myFunc*) And create a myFunc in which you will convert the String to a CSV record and do whatever you want to do with it? Thanks Best Regards On Sun, May 31, 2015 at 2:50 AM, sherine ahmed wrote:

Re: RDD boundaries and triggering processing using tags in the data

2015-06-01 Thread Akhil Das
May be you can make use of the Window operations , Also another approach would be to keep your incoming data in Hbase/Redis/Cassandra kind of database and then whenever you need to average it, you just query the

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Akhil Das
This thread has various methods on accessing S3 from spark, it might help you. Thanks Best Regards On Sun, May 24, 2015 at 8:03 AM, ogoh wrote: > > Hello, > I am using Spark1.3 i

Re: Cassanda example

2015-06-01 Thread Akhil Das
Here's a more detailed documentation from Datastax, You can also shoot an email directly to their mailing list since its more related to their code. Thanks Bes

Re: flatMap output on disk / flatMap memory overhead

2015-06-01 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think StorageLevel MEMORY_AND_DISK means spark will try to keep the data in memory and if there isn't sufficient space then it will be shipped to the disk. Thanks Best Regards On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea wro

Re: using pyspark with standalone cluster

2015-06-02 Thread Akhil Das
If you want to submit applications to a remote cluster where your port 7077 is opened publically, then you would need to set the *spark.driver.host *(with the public ip of your laptop) and *spark.driver.port* (optional, if there's no firewall between your laptop and the remote cluster). Keeping you

Re: Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-02 Thread Akhil Das
You can try to skip the tests, try with: mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package Thanks Best Regards On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch wrote: > I downloaded the 1.3.1 distro tarball > > $ll ../spark-1.3.1.tar.gz > -rw-r-@ 1 steve staff 8500861 Apr 23 0

Re: HDFS Rest Service not available

2015-06-02 Thread Akhil Das
It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and then sbin/start-dfs.sh Thanks Best Regards On Tue, Jun 2, 2015 at 5:03 AM, Su She wrote: > Hello All, > > A bit scared I did something stupid...I

Re: What is shuffle read and what is shuffle write ?

2015-06-02 Thread Akhil Das
I found an interesting presentation http://www.slideshare.net/colorant/spark-shuffle-introduction and go through this thread also http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html Thanks Best Regards On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:

Re: Shared / NFS filesystems

2015-06-02 Thread Akhil Das
You can run/submit your code from one of the worker which has access to the file system and it should be fine i think. Give it a try. Thanks Best Regards On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar wrote: > Hello! > > I have Spark running in standalone mode, and there are a bunch of worker

Re: How to read sequence File.

2015-06-02 Thread Akhil Das
Basically, you need to convert it to a serializable format before doing the collect/take. You can fire up a spark shell and paste this: val sFile = sc.sequenceFile[LongWritable, Text]("/home/akhld/sequence > /sigmoid") > *.map(_._2.toString)* > sFile.take(5).foreach(println) Use t

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee wrote: > Hi, > > What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? > How can I fix it? > > Best, > Patcharee > >

Re: Spark Client

2015-06-03 Thread Akhil Das
Run it as a standalone application. Create an sbt project and do sbt run? Thanks Best Regards On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri < pavan.kolam...@gmail.com> wrote: > Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. > But for my use case i don't want i

Re: Scripting with groovy

2015-06-03 Thread Akhil Das
I think when you do a ssc.stop it will stop your entire application and by "update a transformation function" you mean modifying the driver program? In that case even if you checkpoint your application, it won't be able to recover from its previous state. A simpler approach would be to add certain

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.uti

Re: Spark Client

2015-06-03 Thread Akhil Das
em.exit(NON_ZERO) when there is a failure. Question is, Is there an > alternate api though which a spark application can be launched which can > return a exit status back to the caller as opposed to initiating JVM halt. > > On Wed, Jun 3, 2015 at 12:58 PM, Akhil Das > wrote: > >>

Re: StreamingListener, anyone?

2015-06-04 Thread Akhil Das
Hi Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183 [image: Inline image 1] Thanks Best Regards On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg wrote: > Hi, > > I've got a Spark Streaming driver job implemented and in it, I register a > streaming listener, like so: >

Re: Python Image Library and Spark

2015-06-04 Thread Akhil Das
Replace this line: img_data = sc.parallelize( list(im.getdata()) ) With: img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have ) Thanks Best Regards On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur wrote: > Hi all, > > I'm playing around with manipulating images via Pyth

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread Akhil Das
That's because you need to add the master's public key (~/.ssh/id_rsa.pub) to the newly added slaves ~/.ssh/authorized_keys. I add slaves this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master p

Re: Setting S3 output file grantees for spark output files

2015-06-05 Thread Akhil Das
You could try adding the configuration in the spark-defaults.conf file. And once you run the application you can actually check on the driver UI (runs on 4040) Environment tab to see if the configuration is set properly. Thanks Best Regards On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel wrote:

Re: Saving calculation to single local file

2015-06-05 Thread Akhil Das
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be efficient if your output data is huge since one task will be doing the whole writing. Thanks Best Regards On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo wrote: > Hi all > > I'm running spark in a single local machine, no h

Re: Accumulator map

2015-06-07 Thread Akhil Das
​Another approach would be to use a zookeeper. If you have zookeeper running somewhere in the cluster you can simply create a path like */dynamic-list*​ in it and then write objects/values to it, you can even create/access nested objects. Thanks Best Regards On Fri, Jun 5, 2015 at 7:06 PM, Cosmin

Re: Monitoring Spark Jobs

2015-06-07 Thread Akhil Das
It could be a CPU, IO, Network bottleneck, you need to figure out where exactly its chocking. You can use certain monitoring utilities (like top) to understand it better. Thanks Best Regards On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti wrote: > Hi All, > > I have a Spark SQL application to fetch

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Akhil Das
Which consumer are you using? If you can paste the complete code then may be i can try reproducing it. Thanks Best Regards On Sun, Jun 7, 2015 at 1:53 AM, EH wrote: > And here is the Thread Dump, where seems every worker is waiting for > Executor > #6 Thread 95: sparkExecutor-akka.actor.default

Re: Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-08 Thread Akhil Das
Can you look in your worker logs for more detailed stack-trace? If its about winutils.exe you can look at these links to get it resolved. - http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7 - https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Mon, Jun 8,

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at 2:51 PM, James King wrote: > I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) > > These two hosts have exchanged public keys so t

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
me straight in. > > On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das > wrote: > >> Can you do *ssh -v 192.168.1.16* from the Master machine and make sure >> its able to login without password? >> >> Thanks >> Best Regards >> >> On Mon, Jun 8, 20

Re: Spark error "value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]"

2015-06-08 Thread Akhil Das
Try this way: scala>val input1 = sc.textFile("/test7").map(line => line.split(",").map(_.trim)); scala>val input2 = sc.textFile("/test8").map(line => line.split(",").map(_.trim)); scala>val input11 = input1.map(x=>(*(x(0) + x(1)*),x(2),x(3))) scala>val input22 = input2.map(x=>(*(x(0) + x(1)*),x(2)

Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
May be you should check in your driver UI and see if there's any GC time involved etc. Thanks Best Regards On Mon, Jun 8, 2015 at 5:45 PM, wrote: > hi there > > I am trying to descrease my app's running time in worker node. I > checked the log and found the most time-wasting part is below

Re: Saving compressed textFiles from a DStream in Scala

2015-06-09 Thread Akhil Das
like this? myDStream.foreachRDD(rdd => rdd.saveAsTextFile("/sigmoid/", codec )) Thanks Best Regards On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro wrote: > It looks like saveAsTextFiles doesn't support the compression parameter of > RDD.saveAsTextFile. Is there a way to add the functionality in

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
Once you submits the application, you can check in the driver UI (running on port 4040) Environment Tab to see whether those jars you added got shipped or not. If they are shipped and still you are getting NoClassDef exceptions then it means that you are having a jar conflict which you can resolve

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
> I can see from log that the driver downloaded the application jar but not > the other jars specified by “—jars”. > > > > Or I misunderstand the usage of “--jars”, and the jars should be already > in every worker, driver will not download them? > > Is there some useful

Re: Spark error "value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]"

2015-06-09 Thread Akhil Das
; > > On Tue, Jun 9, 2015 at 1:07 PM, amit tewari > wrote: > >> Thanks Akhil, as you suggested, I have to go keyBy(route) as need the >> columns intact. >> But wil keyBy() take accept multiple fields (eg x(0), x(1))? >> >> Thanks >> Amit >> >

Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
Best Regards On Tue, Jun 9, 2015 at 2:09 PM, wrote: > Only 1 minor GC, 0.07s. > > > > > Thanks&Best regards! > San.Luo > > - 原始邮件 - > 发件人:Akhil Das > 收件人:罗辉 > 抄送人:user > 主题:Re: How to decrease the time of storing

Re: Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
> /opt/data/shellcompare/data/user" + j + "/pgs/intermediateResult/result" + > i + ".txt 600") > > pipeModify2.collect() > > sc.stop() > } > } > > > > > Thanks&Best regards! > San.Luo

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-10 Thread Akhil Das
; \jars [*empty*] >> >> \files [*empty*] >> >> >> >> So I guess the files and jars and not properly downloaded from HDFS to >> these folders? >> >> >> >> I

Re: How to use Apache spark mllib Model output in C++ component

2015-06-10 Thread Akhil Das
Hope Swig and JNA might help for accessing c++ libraries from Java. Thanks Best Regards On Wed, Jun 10, 2015 at 11:50 AM, mahesht wrote: > > There is C++ component which uses some model which we want to replace it by > spark model

Re: Spark's Scala shell killing itself

2015-06-10 Thread Akhil Das
May be you should update your spark version to the latest one. Thanks Best Regards On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar < shekhar.kote...@gmail.com> wrote: > Hi, > > I have configured Spark to run on YARN. Whenever I start spark shell using > 'spark-shell' command, it automat

Re: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Akhil Das
RDD's are immutable, why not join two DStreams? Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd => { val file = ssc.sparkContext.textFile("/sigmoid/") val kvFile = file.map(x => (x.split(",")(0), x)) rdd.join(kvFile) }) Thanks Best Regards

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
4040 is your driver port, you need to run some application. Login to your cluster start a spark-shell and try accessing 4040. Thanks Best Regards On Wed, Jun 10, 2015 at 3:51 PM, mrm wrote: > Hi, > > I am using Spark 1.3.1 standalone and I have a problem where my cluster is > working fine, I ca

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) will work for you then . Thanks Best Regards On Wed, Jun 10, 2015 at 5:10 PM, mrm wrote: > Hi Akhil, > > Thanks for your reply! I still cannot see port 4040 in my machine

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Akhil Das
Delete the checkpoint directory, you might have modified your driver program. Thanks Best Regards On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam wrote: > Hi, > If checkpoint data is already present in HDFS, driver fails to load as it > is performing lookup on previous application directory. As t

Re: Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread Akhil Das
Looks like libphp version is 5.6 now, which version of spark are you using? Thanks Best Regards On Thu, Jun 11, 2015 at 3:46 AM, barmaley wrote: > Launching using spark-ec2 script results in: > > Setting up ganglia > RSYNC'ing /etc/ganglia to slaves... > <...> > Shutting down GANGLIA gmond:

Re: Spark standalone mode and kerberized cluster

2015-06-10 Thread Akhil Das
This might help http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html Thanks Best Regards On Wed, Jun 10, 2015 at 6:49 PM, kazeborja wrote: > Hello all. > > I've been reading some old mails and notice that the use o

Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-12 Thread Akhil Das
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this, you can actually add jets3t-0.9.0.jar to the classpath (sc.addJar(/path/to/jets3t-0.9.0.jar). Thanks Best Regards On Thu, Jun 11, 2015 at 6:44 PM, shahab wrote: > Hi, > > I tried to read a csv file from amazon s3, but I

Re: spark stream and spark sql with data warehouse

2015-06-12 Thread Akhil Das
This is a good start, if you haven't read it already http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations Thanks Best Regards On Thu, Jun 11, 2015 at 8:17 PM, 唐思成 wrote: > Hi all: >We are trying to using spark to do some real time data proces

Re: Limit Spark Shuffle Disk Usage

2015-06-12 Thread Akhil Das
You can disable shuffle spill (spark.shuffle.spill ) if you are having enough memory to hold that much data. I believe adding more resources would be your only choice. Thanks Best Regards On Thu, Jun 11, 2015 at 9:46 PM, Al

Re: Optimizing Streaming from Websphere MQ

2015-06-12 Thread Akhil Das
How many cores are you allocating for your job? And how many receivers are you having? It would be good if you can post your custom receiver code, it will help people to understand it better and shed some light. Thanks Best Regards On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh < umesh.chaudh

Re: --jars not working?

2015-06-12 Thread Akhil Das
You can verify if the jars are shipped properly by looking at the driver UI (running on 4040) Environment tab. Thanks Best Regards On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney wrote: > Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos > 0.19.0)... > > Regardless, I'm r

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Akhil Das
Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia wrote: > I would like to have a Spark Streaming SQS Receiver which deletes SQS > messages only aft

Re: How to split log data into different files according to severity

2015-06-13 Thread Akhil Das
Are you looking for something like filter? See a similar example here https://spark.apache.org/examples.html Thanks Best Regards On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang wrote: > Hi, > > I have a bunch of large log files on Hadoop. Each line contains a log and > its severity. Is there a way th

Re: Are there ways to restrict what parameters users can set for a Spark job?

2015-06-13 Thread Akhil Das
I think the straight answer would be No, but yes you can actually hardcode these parameters if you want. Look in the SparkContext.scala where all these properties are being initi

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Akhil Das
Have a look here https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng wrote: > Hi, Spark Experts > > I have played with Spark several weeks, after some time testing, a reduce > operation of DataFrame cost 40s on a cluster with 5 d

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Akhil Das
I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang wrote: > Hi, can someone help to confirm the behavior? Thank you! > > -Original Message- > From: Haopu Wang > Sent: Friday

Re: How to set up a Spark Client node?

2015-06-15 Thread Akhil Das
I'm assuming by spark-client you mean the spark driver program. In that case you can pick any machine (say Node 7), create your driver program in it and use spark-submit to submit it to the cluster or if you create the SparkContext within your driver program (specifying all the properties) then you

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile("/path/to/first.csv").map(x => (x.split("\t")(1), x.split("\t")(0)) val gender_data = sc.textFile("/path/to/second.csv"),map(x => (x.split("\t")(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python a

Re: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Akhil Das
0 receivers (i.e. one receiver for each core), then I am > not experiencing any performance benefit from it. > > Is it something related to the bottleneck of MQ or Reliable Receiver? > > > > *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] > *Sent:* Saturday, June 13, 2015

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
gt; > > > Please correct me if the understanding is wrong, thanks again! > > > -- > > *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] > *Sent:* Monday, June 15, 2015 3:48 PM > *To:* Haopu Wang > *Cc:* user > *Subject:* Re: If n

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X wrote: > Thanks very much, Akhil. > > That solved my problem. > > Best, > Rex > > > > On Mon, Jun 15, 2015

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs? Thanks Best Regards On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden wrote: > My Mesos cluster has 1.5 CPU and 17GB free. If I set: > > conf.set("spark.mesos.coarse", "true"); > conf.set("spark.cores.max", "1"); > > in the SparkConf object

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file? Thanks Best Regards On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden wrote: > I'm loading these settings from a properties file: > spark.executor.memory=256M > spark.cores.max=1 > spark.shuffle.consolidateFiles=true > spark.task.c

Re: Spark History Server pointing to S3

2015-06-16 Thread Akhil Das
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera < gianluca.privite...@studio.unibo.it> wrote: > In Spark website it’s stated in the View After the Fact section ( > https://spark.apache.org/docs/

Re: how to maintain the offset for spark streaming if HDFS is the source

2015-06-16 Thread Akhil Das
With sparkstreaming when you use fileStream or textFileStream it will always pick up the files from the directory whose timestamp is > the current timestamp, and if you have checkpointing enabled then it would start from the last read timestamp. So you may not need to maintain the line number. Tha

Re: Shuffle produces one huge partition

2015-06-17 Thread Akhil Das
Can you try repartitioning the rdd after creating the K,V. And also, while calling the rdd1.join(rdd2, Pass the # partition argument too) Thanks Best Regards On Wed, Jun 17, 2015 at 12:15 PM, Al M wrote: > I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A > has > 1 billio

Re: ClassNotFound exception from closure

2015-06-17 Thread Akhil Das
Not sure why spark-submit isn't shipping your project jar (may be try with --jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should solve it. Thanks Best Regards On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska wrote: > Hi folks, > > running into a pretty strange issue -- I have

Re: Machine Learning on GraphX

2015-06-18 Thread Akhil Das
This might give you a good start http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html its a bit old though. Thanks Best Regards On Thu, Jun 18, 2015 at 2:33 PM, texol wrote: > Hi, > > I'm new to GraphX and I'd like to use Machine Learning algorithms on top of >

Re: Web UI vs History Server Bugs

2015-06-18 Thread Akhil Das
You could possibly open up a JIRA and shoot an email to the dev list. Thanks Best Regards On Wed, Jun 17, 2015 at 11:40 PM, jcai wrote: > Hi, > > I am running this on Spark stand-alone mode. I find that when I examine the > web UI, a couple bugs arise: > > 1. There is a discrepancy between the

Re: understanding on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-18 Thread Akhil Das
Which version of spark? and what is your data source? For some reason, your processing delay is exceeding the batch duration. And its strange that you are not seeing any scheduling delay. Thanks Best Regards On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang wrote: > Hi, > > > > I have a spark streamin

Re: connect mobile app with Spark backend

2015-06-18 Thread Akhil Das
Why not something like your mobile app pushes data to your webserver which pushes the data to Kafka or Cassandra or any other database and have a Spark streaming job running all the time operating on the incoming data and pushes the calculated values back. This way, you don't have to start a spark

  1   2   3   4   5   6   7   8   9   10   >