Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
Please checkout the Spark source from Github, and look here: https://github.com/apache/spark/tree/master/examples/src/main On Fri, Jul 24, 2015 at 8:43 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > Hi. > Can I know how to get such folder/code for spark implementation? > > On Sa

Re: Mesos + Spark

2015-07-24 Thread Dean Wampler
You can certainly start jobs without Chronos, but to automatically restart finished jobs or to run jobs at specific times or periods, you'll want something like Chronos. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) T

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
I'd recommend starting with a few of the code examples to get a sense of Spark usage (in the examples/ folder when you check out the code). Then, you can work through the Spark methods they call, tracing as deep as needed to understand the component you are interested in. You can also find an int

Small-cluster deployment modes

2015-07-24 Thread Edmon Begoli
Hey folks, I am wanting to setup a single machine or a small cluster machine to run our Spark based exploration lab. Does anyone have suggestions or metrics on feasibility of running Spark standalone on a good size RAM machine (64GB) with SSDs without resource manager. I expect on or two users a

Re: Parquet writing gets progressively slower

2015-07-24 Thread Cheng Lian
The time is probably spent by ParquetOutputFormat.commitJob. While committing a successful write job, Parquet writes a pair of summary files, containing metadata like schema, user defined key-value metadata, and Parquet row group information. To gather all the necessary information, Parquet sca

Stop condition Spark reading from Kafka with ReliableKafkaReceiver

2015-07-24 Thread cas...@163.com
Hi All, Now I am writting a test code for reading messages from topic of Kafka,I used the ReliableKafkaReceiver, However,when i created the KafkaInputDStream and store all received message in an foreachRDD action,I don't know when all the messages have been consumed comletely? in other w

Re: Mesos + Spark

2015-07-24 Thread boci
Thanks, Mesos will show spark is driver is running, but what happened if my batch job finished? How can I reschedule without chronos ? Can I submit a job without start it? Thanks b0c1

Re: Mesos + Spark

2015-07-24 Thread Dean Wampler
When running Spark in Mesos cluster mode, the driver program runs in one of the cluster nodes, like the other Spark processes that are spawned. You won't need a special node for this purpose. I'm not very familiar with Chronos, but its UI or the regular Mesos UI should show you where the driver is

Re: Mesos + Spark

2015-07-24 Thread boci
Thanks, but something is not clear... I have the mesos cluster. - I want to submit my application and scheduled with chronos. - For cluster mode I need a dispatcher, this is another container (machine in the real world)? What will this do? It's needed when I using chronos? - How can I access to my

Parquet writing gets progressively slower

2015-07-24 Thread Michael Kelly
Hi, We are converting some csv log files to parquet but the job is getting progressively slower the more files we add to the parquet folder. The parquet files are being written to s3, we are using a spark standalone cluster running on ec2 and the spark version is 1.4.1. The parquet files are par

Re: spark classpath issue duplicate jar with diff versions

2015-07-24 Thread Marcelo Vanzin
(bcc: user@spark, cc: cdh-user@cloudera) This is a CDH issue, so I'm moving it to the CDH mailing list. We're taking a look at how we're packaging dependencies so that these issues happen less when running on CDH. But in the meantime, instead of using "--jars", you could instead add the newer jar

spark classpath issue duplicate jar with diff versions

2015-07-24 Thread Shushant Arora
Hi I am running a spark stream app on yarn and using apache httpasyncclient 4.1 This client Jar internally has a dependency on jar http-core4.4.1.jar. This jar's( http-core .jar) old version i.e. httpcore-4.2.5.jar is also present in class path and has higher priority in classpath(coming earlier

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen
Hi, When running two experiments with the same application, we see a 50% performance difference between using HDFS and files on disk, both using the textFile/saveAsTextFile call. Almost all performance loss is in Stage 1. Input (in Stage 0): The file is read in using val input = sc.textFile(inpu

Re: Writing binary files in Spark

2015-07-24 Thread Oren Shpigel
Sorry, I didn't mention I'm using the Python API, which doesn't have the saveAsObjectFiles method. Is there any alternative from Python? And also, I want to write the raw bytes of my object into files on disk, and not using some Serialization format to be read back into Spark. Is it possible? Any

How to maintain Windows of data along with maintaining session state using UpdateStateByKey

2015-07-24 Thread swetha
Hi, We have a requirement to maintain the user session state and to maintain/update the metrics for minute, day and hour granularities for a user session in our Streaming job. How to maintain the metrics over a window along with maintaining the session state using updateStateByKey in Spark Stream

Re: Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Brandon White
THanks. Sorry the last section was supposed be streams.par.foreach { nameAndStream => nameAndStream._2.foreachRDD { rdd => df = sqlContext.jsonRDD(rdd) df.insertInto(stream._1) } } ssc.start() On Fri, Jul 24, 2015 at 10:39 AM, Dean Wampler wrote: > You don't need the "par" (paral

Fwd: want to contribute to apache spark

2015-07-24 Thread shashank kapoor
Hi guys, I am new to apache spark, I wanted to start contributing to this project. But before that I need to understand the basic coding flow here. I read "How to contribute to apache spark" but I couldn't find any way to start reading the code and start understanding Code Flow. Can anyone tell me

Re: Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Dean Wampler
You don't need the "par" (parallel) versions of the Scala collections, actually, Recall that you are building a pipeline in the driver, but it doesn't start running cluster tasks until ssc.start() is called, at which point Spark will figure out the task parallelism. In fact, you might as well do th

Re: How to restart Twitter spark stream

2015-07-24 Thread Zoran Jeremic
Hi Akhil, That's exactly what I needed. You saved my day :) Thanks a lot, Best, Zoran On Fri, Jul 24, 2015 at 12:28 AM, Akhil Das wrote: > Yes, that is correct, sorry for confusing you. But i guess this is what > you are looking for, let me know if that doesn't help: > > val filtered_statuses

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Stevo Slavić
Sorry, wrong ML. On Fri, Jul 24, 2015 at 7:07 PM, Cody Koeninger wrote: > Are you intending to be mailing the spark list or the kafka list? > > On Fri, Jul 24, 2015 at 11:56 AM, Stevo Slavić wrote: > >> Hello Cody, >> >> I'm not sure we're talking about same thing. >> >> Since you're mentioning

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Cody Koeninger
Are you intending to be mailing the spark list or the kafka list? On Fri, Jul 24, 2015 at 11:56 AM, Stevo Slavić wrote: > Hello Cody, > > I'm not sure we're talking about same thing. > > Since you're mentioning streams I guess you were referring to current high > level consumer, while I'm talkin

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Stevo Slavić
Hello Cody, I'm not sure we're talking about same thing. Since you're mentioning streams I guess you were referring to current high level consumer, while I'm talking about new yet unreleased high level consumer. Poll I was referring to is https://github.com/apache/kafka/blob/trunk/clients/src/ma

Re: New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Cody Koeninger
Well... there are only 2 hard problems in computer science: naming things, cache invalidation, and off-by-one errors. The direct stream implementation isn't asking you to "commit" anything. It's asking you to provide a starting point for the stream on startup. Because offset ranges are inclusive

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-24 Thread Arun Ahuja
Thanks for the additional info, I tried to follow that and went ahead and directly added netlib to my application POM/JAR - that should be sufficient to make it work? And that is at least definietely on the executor class path? Still got the same warning, so not sure where else to take it. Thanks

Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Brandon White
Hello, So I have about 500 Spark Streams and I want to know the fastest and most reliable way to process each of them. Right now, I am creating and process them in a list: val ssc = new StreamingContext(sc, Minutes(10)) val streams = paths.par.map { nameAndPath => (path._1, ssc.textFileStream

New consumer - offset one gets in poll is not offset one is supposed to commit

2015-07-24 Thread Stevo Slavić
Hello Apache Kafka community, Say there is only one topic with single partition and a single message on it. Result of calling a poll with new consumer will return ConsumerRecord for that message and it will have offset of 0. After processing message, current KafkaConsumer implementation expects o

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-24 Thread Yana Kadiyska
that is pretty odd -- toMap not being there would be from scala...but what is even weirder is that toMap is positively executed on the driver machine, which is the same when you do spark-shell and spark-submit...does it work if you run with --master local[*]? Also, you can try to put a set -x in b

Spark: configuration file 'metrics.properties'

2015-07-24 Thread allonsy
Hi, Spark's configuration file (useful to retrieve metrics), namely //conf/metrics.properties/, states what follows: Within an instance, a "source" specifies a particular set of grouped metrics. there are two kinds of sources: 1. Spark internal sources, like /MasterSource/, /WorkerSource/, etc,

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Ajay Singal
Hi Joji, To my knowledge, Spark does not offer any such function. I agree, defining a function to find an open (random) port would be a good option. However, in order to invoke the corresponding SparkUI one needs to know this port number. Thanks, Ajay On Fri, Jul 24, 2015 at 10:19 AM, Joji Jo

Re: Facing problem in Oracle VM Virtual Box

2015-07-24 Thread Ajay Singal
Hi Chintan, This is more of Oracle VirtualBox virtualization issue than Spark issue. VT-x is hardware assisted virtualization and it is required by Oracle VirtualBox for all (64 bits) guests. The error message indicates that either your processor does not support VT-x (but your VM is configure

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Joji John
Thanks Ajay. The way we wrote our spark application is that we have a generic python code, multiple instances of which can be called using different parameters. Does spark offer any function to bind it to a available port? I guess the other option is to define a function to find open port and

Broadcast HashMap much slower than Array

2015-07-24 Thread huanglr
Hi, When I try to broadcast a hashmap, it runs much slower than the same data broadcast in array. It hangs in "SparkContext: Created broadcast 0" for few secondes (30s), while an array does not. The broadcast dataset is about 1G. best! huanglr

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Ajay Singal
Hi Jodi, I guess, there is no hard limit on number of Spark applications running in parallel. However, you need to ensure that you do not use the same (e.g., default) port numbers for each application. In your specific case, for example, if you try using default SparkUI port "4040" for more than

getting Error while Running SparkPi program

2015-07-24 Thread Jeetendra Gangele
while running below getting the error un yarn log can anybody hit this issue ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster lib/spark-examples-1.4.1-hadoop2.6.0.jar 10 2015-07-24 12:06:10,846 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.

Performance questions regarding Spark 1.3 standalone mode

2015-07-24 Thread Khaled Ammar
Hi all, I have a standalone spark cluster setup on EC2 machines. I did the setup manually without the ec2 scripts. I have two questions about Spark/GraphX performance: 1) When I run the PageRank example, the storage tab does not show that all RDDs are cached. Only one RDD is 100% cached, but the

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-24 Thread Cody Koeninger
It's really a question of whether you need access to the MessageAndMetadata, or just the key / value from the message. If you just need the key/value, dstream map is fine. In your case, since you need to be able to control a possible failure when deserializing the message from the MessageAndMetad

suggest coding platform

2015-07-24 Thread Saif.A.Ellafi
Hi all, I tried Notebook Incubator Zeppelin, but I am not completely happy with it. What do you people use for coding? Anything with auto-complete, proper warning logs and perhaps some colored syntax. My platform is on linux, so anything with some notebook studio, or perhaps a windows IDE with

spark-ec2 credentials using aws_security_token

2015-07-24 Thread jan.zikes
Hi,   I would like to ask if it is currently possible to use spark-ec2 script together with credentials that are consisting not only from: aws_access_key_id and aws_secret_access_key, but it also contains aws_security_token.   When I try to run the script I am getting following error message:  

Re: Spark - Eclipse IDE - Maven

2015-07-24 Thread Naveen Madhire
You can use Intellij for Scala. There are many articles online which you can refer for setting up Intellij and scala pluggin. Thanks On Friday, July 24, 2015, Siva Reddy wrote: > I want to program in scala for spark. > > > > -- > View this message in context: > http://apache-spark-user-list.10

Spark Accumulator Issue - java.io.IOException: java.lang.StackOverflowError

2015-07-24 Thread Jadhav Shweta
Hi, I am trying one transformation by calling scala method this scala method returns MutableList[AvroObject] def processRecords(id: String, list1: Iterable[(String, GenericRecord)]): scala.collection.mutable.MutableList[AvroObject]  Hence, the output of transaformation is RDD[MutableList[AvroOb

How do I query a DSE Cassandra table using Spark Job Server

2015-07-24 Thread rsaggere
I am trying to use the spark job server to run a query against a DSE Cassandra table. Can any one please help me with instructions? I am *NOT* a Java person. What I have done so far: 1. I have a single node DataStax Cassandra ver 4.6 running on a Centos 6.6 VM 2. Tested "dse spark" to query table

ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Joji John
HI, I am getting this error for some of spark applications. I have multiple spark applications running in parallel. Is there a limit in the number of spark applications that I can run in parallel. ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service '

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-24 Thread Rok Roskar
Hi Akhil, the namenode is definitely configured correctly, otherwise the job would not start at all. It registers with YARN and starts up, but once the nodes try to communicate to each other it fails. Note that a hadoop MR job using the identical configuration executes without any problems. The dr

Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
Your guess is partly right. Firstly, Spark SQL doesn’t have an equivalent data type to Parquet BINARY (ENUM), and always falls back to normal StringType. So in your case, Spark SQL just see a StringType, which maps to Parquet BINARY (UTF8), but the underlying data type is BINARY (ENUM). Secon

Encryption on RDDs or in-memory on Apache Spark

2015-07-24 Thread IASIB1
I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory (similarly to Altibase's HDB: http://altibase.com/in-memory-database-computing-solutions/security/

Re: Spark - Eclipse IDE - Maven

2015-07-24 Thread Siva Reddy
I want to program in scala for spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977p23981.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-24 Thread Cheng Lian
I don’t think this is a bug either. For an empty JSON array |[]|, there’s simply no way to infer its actual data type, and in this case Spark SQL just tries to fill in the “safest” type, which is |StringType|, because basically you can cast any data type to |StringType|. In general, schema inf

Re: Performance issue with Spak's foreachpartition method

2015-07-24 Thread Bagavath
Try using insert instead of merge. Typically we use insert append to do bulk inserts to oracle. On Thu, Jul 23, 2015 at 1:12 AM, diplomatic Guru wrote: > Thanks Robin for your reply. > > I'm pretty sure that writing to Oracle is taking longer as when writing to > HDFS is only taking ~5minutes.

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-24 Thread Nicolas Phung
Hello, I manage to read all my data back with skipping offset that contains a corrupt message. I have one more question regarding messageHandler method vs dstream.foreachRDD.map vs dstream.map.foreachRDD best practices. I'm using a function to read the serialized message from kafka and convert it

Re: What if request cores are not satisfied

2015-07-24 Thread Akhil Das
I guess it would wait for sometime and throw up something like this: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Thanks Best Regards On Thu, Jul 23, 2015 at 7:53 AM, bit1...@163.com wrote: > Hi, > Assume a

Re: How to restart Twitter spark stream

2015-07-24 Thread Akhil Das
Yes, that is correct, sorry for confusing you. But i guess this is what you are looking for, let me know if that doesn't help: val filtered_statuses = stream.transform(rdd =>{ //Instead of hardcoding, you can fetch these from a MySQL or a file or whatever val sampleHashTags = Array("#