Aggregate by timestamp from json message

2015-08-06 Thread vchandra
Hi team, I am very new to SPARK, actually today is my first day. I have a nested json string which contains timestamp and lot of other details. I have json messages from which I need to write multiple aggregation but for now I need to write one aggregation. If code structu

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-06 Thread Akhil Das
You just pasted your twitter credentials, consider changing it. :/ Thanks Best Regards On Wed, Aug 5, 2015 at 10:07 PM, narendra wrote: > Thanks Akash for the answer. I added endpoint to the listener and now it is > working. > > > > -- > View this message in context: > http://apache-spark-user-

Re: subscribe

2015-08-06 Thread Akhil Das
Welcome aboard! Thanks Best Regards On Thu, Aug 6, 2015 at 11:21 AM, Franc Carter wrote: > subscribe >

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Philip Weaver
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried again. The initialization time is about 1 minute now, which is still pretty terrible. On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver wrote: > Absolutely, thanks! > > On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian wrote: > >>

Multiple Thrift servers on one Spark cluster

2015-08-06 Thread Bojan Kostic
Hi, Is there a way to instantiate multiple Thrift servers on one Spark Cluster? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Thrift-servers-on-one-Spark-cluster-tp24148.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Enum values in custom objects mess up RDD operations

2015-08-06 Thread Warfish
Hi everyone, I was working with Spark for a little while now and have encountered a very strange behaviour that caused me a lot of headaches: I have written my own POJOs to encapsulate my data and this data is held in some JavaRDDs. Part of these POJOs is a member variable of a custom enum type.

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Igor Berman
enums hashcode is jvm instance specific(ie. different jvms will give you different values), so you can use ordinal in hashCode computation or use hashCode on enums ordinal as part of hashCode computation On 6 August 2015 at 11:41, Warfish wrote: > Hi everyone, > > I was working with Spark for a

Re: subscribe

2015-08-06 Thread Ted Yu
See http://spark.apache.org/community.html Cheers > On Aug 5, 2015, at 10:51 PM, Franc Carter > wrote: > > subscribe

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Sebastian Kalix
Thanks a lot Igor, the following hashCode function is stable: @Override public int hashCode() { int hash = 5; hash = 41 * hash + this.myEnum.ordinal(); return hash; } For anyone having the same problem: http://tech.technoflirt.com/2014/08/21/issues-with-java-e

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-08-06 Thread Ted Yu
What is the JIRA number if a JIRA has been logged for this ? Thanks > On Jan 20, 2015, at 11:30 AM, Cheng Lian wrote: > > Hey Yi, > > I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like > to investigate this issue later. Would you please open an JIRA for it? Thanks

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-06 Thread Dimitris Kouzis - Loukas
Hi, - yes - it's great that you wrote it yourself - it means you have more control. I have the feeling that the most efficient point to discard as much data as possible - or even modify your subscription protocol to - your spark input source - not even receive the other 50 seconds of data is the mo

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-06 Thread Dimitris Kouzis - Loukas
Re-reading your description - I guess you could potentially make your input source to connect for 10 seconds, pause for 50 and then reconnect. On Thu, Aug 6, 2015 at 10:32 AM, Dimitris Kouzis - Loukas wrote: > Hi, - yes - it's great that you wrote it yourself - it means you have more > control.

How can I know currently supported functions in Spark SQL

2015-08-06 Thread Netwaver
Hi All, I am using Spark 1.4.1, and I want to know how can I find the complete function list supported in Spark SQL, currently I only know 'sum','count','min','max'. Thanks a lot.

Re: spark job not accepting resources from worker

2015-08-06 Thread Kushal Chokhani
Any inputs? In case of following message, is there a way to check which resources is not sufficient through some logs? [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered

Re:Re: Real-time data visualization with Zeppelin

2015-08-06 Thread jun
Hi andy, Is there any method to convert ipython notebook file(.ipynb) to spark notebook file(.snb) or vice versa? BR Jun At 2015-07-13 02:45:57, "andy petrella" wrote: Heya, You might be looking for something like this I guess: https://www.youtube.com/watch?v=kB4kRQRFAVc. The Spark-No

Re: spark job not accepting resources from worker

2015-08-06 Thread Kushal Chokhani
Figured out the root cause. Master was randomly assigning port to worker for communication. Because of the firewall on master, worker couldn't send out messages to master (maybe like resource details). Weird worker didn't even bother to throw any error also. On 8/6/2015 3:24 PM, Kushal Chokhan

Re: Re: Real-time data visualization with Zeppelin

2015-08-06 Thread andy petrella
Yep, most of the things will work just by renaming it :-D You can even use nbconvert afterwards On Thu, Aug 6, 2015 at 12:09 PM jun wrote: > Hi andy, > > Is there any method to convert ipython notebook file(.ipynb) to spark > notebook file(.snb) or vice versa? > > BR > Jun > > At 2015-07-13 02:

how to stop twitter-spark streaming

2015-08-06 Thread Sadaf
Hi All, i am working with spark streaming and twitter's user api. i used this code to stop streaming ssc.addStreamingListener(new StreamingListener{ var count=1 override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { count += 1 if

Re: Memory allocation error with Spark 1.5

2015-08-06 Thread Alexis Seigneurin
Works like a charm. Thanks Reynold for the quick and efficient response! Alexis 2015-08-05 19:19 GMT+02:00 Reynold Xin : > In Spark 1.5, we have a new way to manage memory (part of Project > Tungsten). The default unit of memory allocation is 64MB, which is way too > high when you have 1G of mem

Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Ted Yu
Have you looked at this? http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$ > On Aug 6, 2015, at 2:52 AM, Netwaver wrote: > > Hi All, > I am using Spark 1.4.1, and I want to know how can I find the > complete function list supported in Spark SQL,

Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Todd Nist
They are covered here in the docs: http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wrote: > Hi All, > I am using Spark 1.4.1, and I want to know how can I find the > complete function list supported in Sp

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Daniel Haviv
Thank you Todd, How is the sparkstreaming-sql project different from starting a thrift server on a streaming app ? Thanks again. Daniel On Thu, Aug 6, 2015 at 1:53 AM, Todd Nist wrote: > Hi Danniel, > > It is possible to create an instance of the SparkSQL Thrift server, > however seems like th

Spark-submit not finding main class and the error reflects different path to jar file than specified

2015-08-06 Thread Stephen Boesch
Given the following command line to spark-submit: bin/spark-submit --verbose --master local[2]--class org.yardstick.spark.SparkCoreRDDBenchmark /shared/ysgood/target/yardstick-spark-uber-0.0.1.jar Here is the output: NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahea

Re: Combining Spark Files with saveAsTextFile

2015-08-06 Thread MEETHU MATHEW
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks & Regards, Meethu M On Wednesday, 5 August 2015 7:53 AM, Brandon White wrote: What is the best way to make saveAsTextFile save as only a single file?

Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread venkatesh b
Hi, here I got two things to know. FIRST: In our project we use hive. We daily get new data. We need to process this new data only once. And send this processed data to RDBMS. Here in processing we majorly use many complex queries with joins with where condition and grouping functions. There are ma

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Todd Nist
Well the creation of a thrift server would be to allow external access to the data from JDBC / ODBC type connections. The sparkstreaming-sql leverages a standard spark sql context and then provides a means of converting an incoming dstream into a row, look at the MessageToRow trait in KafkaSource

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread Jörn Franke
Yes you should use orc it is much faster and more compact. Additionally you can apply compression (snappy) to increase performance. Your data processing pipeline seems to be not.very optimized. You should use the newest hive version enabling storage indexes and bloom filters on appropriate columns.

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread Jörn Franke
Additionally it is of key importance to use the right data types for the columns. Use int for ids, int or decimal or float or double etc for numeric values etc. - A bad data model using varchars and string where not appropriate is a significant bottle neck. Furthermore include partition columns in

Re: Upgrade of Spark-Streaming application

2015-08-06 Thread Shushant Arora
Also Is in fromoffset api last saved offset is fetched twice ? Is fromoffset api starts from Map's Long value or LongValue+1 ? If its from Longvalue - it will be twice - once it was in last application's run before crash and once after crash in first run ? On Thu, Aug 6, 2015 at 9:05 AM, Shushant

SparkR -Graphx

2015-08-06 Thread smagadi
Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as of now it is not possible.I was thinking if one can write a wrapper in R that can call Scala Graphx libraries . Any thought on this please. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabbl

SparkException: Yarn application has already ended

2015-08-06 Thread Clint McNeil
Hi I am trying to launch a Spark application on a CM cluster and I get the following error. Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.Yarn

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread venkatesh b
I'm really sorry, by mistake I posted in spark mailing list. Jorn Frankie Thanks for your reply. I have many joins, many complex queries and all are table scans. So I think HBase do not work for me. On Thursday, August 6, 2015, Jörn Franke wrote: > Additionally it is of key importance to use th

Re: Reliable Streaming Receiver

2015-08-06 Thread Cody Koeninger
You should be able to recompile the streaming-kafka project against 1.2, let me know if you run into any issues. >From a usability standpoint, the only relevant thing I can think of that was added after 1.2 was being able to get the partitionId off of the task context... you can just use mapPartit

Re: Upgrade of Spark-Streaming application

2015-08-06 Thread Cody Koeninger
Do the cast to HasOffsetRanges before calling any other methods on the direct stream. This is covered in the documentation: http://spark.apache.org/docs/latest/streaming-kafka-integration.html If you want to use fromOffsets, you can also just grab the highest available offsets from Kafka and pro

Temp file missing when training logistic regression

2015-08-06 Thread Cat
Hello, I am using the Python API to perform a grid search and train models using LogisticRegressionWithSGD. I am using r3.xl machines in EC2, running on top of YARN in cluster mode. The training RDD is persisted in memory and on disk. Some of the models train successfully, but then at some poi

How to binarize data in spark

2015-08-06 Thread Adamantios Corais
I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 > user1,class1,product2 > user1,class1,product5 > user2,class1,product2 > user2,class1,product5 > user3,class2,product1 > etc There are about 1M users, 2 classes, a

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Cheng Lian
Would you mind to provide the driver log? On 8/6/15 3:58 PM, Philip Weaver wrote: I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried again. The initialization time is about 1 minute now, which is still pretty terrible. On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver

Terminate streaming app on cluster restart

2015-08-06 Thread Alexander Krasheninnikov
Hello, everyone! I have a case, when running standalone cluster: on master stop-all.sh/star-all.sh are invoked, streaming app loses all it's executors, but does not interrupt. Since it is a streaming app, expected to get it's results ASAP, an downtime is undesirable. Is there any workaround to

Error while using ConcurrentHashMap in Spark Streaming

2015-08-06 Thread UMESH CHAUDHARY
Scenario is: - I have a map of country-code as key and count as value (initially count is 0) - In DStream.foreachRDD I need to update the count for country in the map with new aggregated value I am doing : transient Map aggregationMap=new ConcurrentHashMap(); Integer requestCountPe

RE: Execption while using kryo with broadcast

2015-08-06 Thread Shuai Zheng
Hi, I have exactly same issue on Spark 1.4.1 (on EMR latest default AMI 4.0), run as Yarn client. And after wrapped with another java hashMap, the exception disappear. But may I know what is right solution? Any JIRA ticket is created for this? I want to monitor it, even it could be bypass b

Re: SparkR -Graphx

2015-08-06 Thread Shivaram Venkataraman
+Xiangrui I am not sure exposing the entire GraphX API would make sense as it contains a lot of low level functions. However we could expose some high level functions like PageRank etc. Xiangrui, who has been working on similar techniques to expose MLLib functions like GLM might have more to add.

Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-06 Thread SK
Hi, I need to access data on S3 from another account and I have been given the IAM role information to access that S3 bucket. From what I understand, AWS allows us to attach a role to a resource at the time it is created. However, I don't see an option for specifying the role using the spark_ec2.p

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Philip Weaver
With DEBUG, the log output was over 10MB, so I opted for just INFO output. The (sanitized) log is attached. The driver is essentially this code: info("A") val t = System.currentTimeMillis val df = sqlContext.read.parquet(dir).select(...).cache val elapsed = System.currentTimeMil

Re: Error while using ConcurrentHashMap in Spark Streaming

2015-08-06 Thread Ted Yu
bq. aggregationMap.put(countryCode,requestCountPerCountry+1); If NPE came from the above line, maybe requestCountPerCountry was null ? Cheers On Thu, Aug 6, 2015 at 8:54 AM, UMESH CHAUDHARY wrote: > Scenario is: > >- I have a map of country-code as key and count as value (initially >co

Re: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-06 Thread Steve Loughran
There's no support for IAM roles in the s3n:// client code in Apache Hadoop ( HADOOP-9384 ); Amazon's modified EMR distro may have it.. The s3a filesystem adds it, —this is ready for production use in Hadoop 2.7.1+ (implicitly HDP 2.3; CDH 5.4 has cherrypicked the relevant patches.) I don't k

Removing empty partitions before we write to HDFS

2015-08-06 Thread gpatcham
Is there a way to filter out empty partitions before I write to HDFS other than using reparition and colasce ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Removing-empty-partitions-before-we-write-to-HDFS-tp24156.html Sent from the Apache Spark User Lis

Spark Kinesis Checkpointing/Processing Delay

2015-08-06 Thread phibit
Hi! I'm using Spark + Kinesis ASL to process and persist stream data to ElasticSearch. For the most part it works nicely. There is a subtle issue I'm running into about how failures are handled. For example's sake, let's say I am processing a Kinesis stream that produces 400 records per second, c

Re: Spark Kinesis Checkpointing/Processing Delay

2015-08-06 Thread Patanachai Tangchaisin
Hi, I actually run into the same problem although our endpoint is not ElasticSearch. When the spark job is dead, we lose some data because Kinesis checkpoint is already beyond the last point that spark is processed. Currently, our workaround is to use spark's checkpoint mechanism with write

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Patanachai Tangchaisin
Currently, I use rdd.isEmpty() Thanks, Patanachai On 08/06/2015 12:02 PM, gpatcham wrote: Is there a way to filter out empty partitions before I write to HDFS other than using reparition and colasce ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Rem

Re: log4j.xml bundled in jar vs log4.properties in spark/conf

2015-08-06 Thread mlemay
I'm having the same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-xml-bundled-in-jar-vs-log4-properties-in-spark-conf-tp23923p24158.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-06 Thread mlemay
Hi, I'm trying to use a custom log4j appender in my log4j.properties. It was perfectly working under spark 1.3.1 but it is now broken. The appender is shaded/bundled in my fat-jar. Note: I've seen that spark 1.3.1 is using a different class loader.. See my SO post: http://stackoverflow.com/ques

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
Well, I try this approach, and still have issues. Apparently TestHive can not delete the hive metastore directory. The complete error that I have is: 15/08/06 15:01:29 ERROR Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metada

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Richard Marscher
Not that I'm aware of. We ran into the similar "issue" where we didn't want to keep accumulating all these empty part files in storage on S3 or HDFS. There didn't seem to be any performance free way to do it with an RDD, so we just run a non-spark post-batch operation to delete empty files from the

Spark-Grid Engine light integration writeup

2015-08-06 Thread David Chin
Hello, all: I was able to get Spark 1.4.1 and 1.2.0 standalone to run within a Univa Grid Engine cluster, with some modification to the appropriate sbin scripts. My write-up is at: http://linuxfollies.blogspot.com/2015/08/apache-spark-integration-with-grid.html I'll be glad to get comments from

Spark Job Failed (Executor Lost & then FS closed)

2015-08-06 Thread ๏̯͡๏
Code: import java.text.SimpleDateFormat import java.util.Calendar import java.sql.Date import org.apache.spark.storage.StorageLevel def extract(array: Array[String], index: Integer) = { if (index < array.length) { array(index).replaceAll("\"", "") } else { "" } } case class GuidSes

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
Well. I managed to solve that issue after running my tests on a linux system instead of windows (which I was originally using). However, now I have an error when I try to reset the hive context using hc.reset(). It tries to create a file inside directory /user/my_user_name instead of the usual linu

All masters are unresponsive! Giving up.

2015-08-06 Thread Jeff Jones
I wrote a very simple Spark 1.4.1 app that I can run through a local driver program just fine using setMaster("local[*]"). The app is as follows: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD object

Spark-submit fails when jar is in HDFS

2015-08-06 Thread Alan Braithwaite
Hi All, We're trying to run spark with mesos and docker in client mode (since mesos doesn't support cluster mode) and load the application Jar from HDFS. The following is the command we're running: /usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf spark.mesos.executor.d

Re:Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Netwaver
Thanks for your kindly help At 2015-08-06 19:28:10, "Todd Nist" wrote: They are covered here in the docs: http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wrote: Hi All, I am using Spark 1.4.1,

Re: How to binarize data in spark

2015-08-06 Thread praveen S
Use StringIndexer in MLib1.4 : https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais < adamantios.cor...@gmail.com> wrote: > I have a set of data based on which I want to create a classification > model. Each

Re: Spark MLib v/s SparkR

2015-08-06 Thread praveen S
I am starting off with classification models, Logistic,RandomForest. Basically wanted to learn Machine learning. Since I have a java background I started off with MLib, but later heard R works as well ( with scaling issues - only). So, with SparkR was wondering the scaling issue would be resolved

stopping spark stream app

2015-08-06 Thread Shushant Arora
Hi I am using spark stream 1.3 and using custom checkpoint to save kafka offsets. 1.Is doing Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { jssc.stop(true, true); System.out.println("Inside Add Shutdown Hook"); } }); to handle stop is safe ? 2.And

Re: How to binarize data in spark

2015-08-06 Thread Yanbo Liang
I think you want to flatten the 1M products to a vector of 1M elements, of course mostly are zero. It looks like HashingTF can help you. 2015-08-07 11:02 GMT+08:00 praveen S : > Use StringIndexer in MLib1.4 : > > htt

Re: Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Pedro Rodriguez
Worth noting that Spark 1.5 is extending that list of Spark SQL functions quite a bit. Not sure where in the docs they would be yet, but the JIRA is here: https://issues.apache.org/jira/browse/SPARK-8159 On Thu, Aug 6, 2015 at 7:27 PM, Netwaver wrote: > Thanks for your kindly help > > > > > > >

Out of memory with twitter spark streaming

2015-08-06 Thread Pankaj Narang
Hi I am running one application using activator where I am retrieving tweets and storing them to mysql database using below code. I get OOM error after 5-6 hour with xmx1048M. If I increase the memory the OOM get delayed only. Can anybody give me clue. Here is the code var tweetStream = Twi

Spark-submit fails when jar is in HDFS

2015-08-06 Thread abraithwaite
Hi All, We're trying to run spark with mesos and docker in client mode (since mesos doesn't support cluster mode) and load the application Jar from HDFS. The following is the command we're running: We're getting the following warning before an exception from that command: Before I debug furthe

Re: All masters are unresponsive! Giving up.

2015-08-06 Thread Sonal Goyal
There seems to be a version mismatch somewhere. You can try and find out the cause with debug serialization information. I think the jvm flag -Dsun.io.serialization.extendedDebugInfo=true should help. Best Regards, Sonal Founder, Nube Technologies Check out Reifier at Spa

StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-06 Thread praveen S
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting the document for analysis?