Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
; > test > > test > > > > > > > > On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis > wrote: > >> Hello, >> >> I a

How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Mike Trienis
Hello, I am using sbt and created a unit test where I create a `HiveContext` and execute some query and then return. Each time I run the unit test the JVM will increase it's memory usage until I get the error: Internal error when running tests: java.lang.OutOfMemoryError: PermGen space Exception

Spark SQL window functions (RowsBetween)

2015-08-20 Thread Mike Trienis
Hi All, I would like some clarification regarding window functions for Apache Spark 1.4.0 - https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html In particular, the "rowsBetween" * {{{ * val w = Window.partitionBy("name").orderBy("id") * df.se

Optimal way to implement a small lookup table for identifiers in an RDD

2015-08-10 Thread Mike Trienis
Hi All, I have an RDD of case class objects. scala> case class Entity( | value: String, | identifier: String | ) defined class Entity scala> Entity("hello", "id1") res25: Entity = Entity(hello,id1) During a map operation, I'd like to return a new RDD that contains all of

Re: Data frames select and where clause dependency

2015-07-20 Thread Mike Trienis
ow() > > > > Mohammed > > > > *From:* Harish Butani [mailto:rhbutani.sp...@gmail.com] > *Sent:* Monday, July 20, 2015 5:37 PM > *To:* Mohammed Guller > *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org > > *Subject:* Re: Data frames select and where clause

Data frames select and where clause dependency

2015-07-17 Thread Mike Trienis
I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select("field1", "filter_field").filter(df("filter_field") === "value").show() However, the next one fails with the error "in operator !Filter (filter_fie

Aggregating metrics using Cassandra and Spark streaming

2015-06-24 Thread Mike Trienis
Hello, I'd like to understand how other people have been aggregating metrics using Spark Streaming and Cassandra database. Currently I have design some data models that will stored the rolled up metrics. There are two models that I am considering: CREATE TABLE rollup_using_counters ( metric_1

Re: Managing spark processes via supervisord

2015-06-05 Thread Mike Trienis
> since they are usually foreground processes > with master it's a bit more complicated, ./sbin/start-master.sh goes > background which is not good for supervisor, but anyway I think it's > doable(going to setup it too in a few days) > > On 3 June 2015 at 21:46, Mike Trieni

Managing spark processes via supervisord

2015-06-03 Thread Mike Trienis
Hi All, I am curious to know if anyone has successfully deployed a spark cluster using supervisord? - http://supervisord.org/ Currently I am using the cluster launch scripts which are working greater, however, every time I reboot my VM or development environment I need to re-launch the cluste

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-23 Thread Mike Trienis
cutor is simply a jvm instance and > as such it can be granted any number of cores and ram > > So check how many cores you have per executor > > > Sent from Samsung Mobile > > > Original message > From: Mike Trienis > Date:2015/05/22 21:51 (GMT+00:

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
I guess each receiver occupies a executor. So there was only one executor available for processing the job. On Fri, May 22, 2015 at 1:24 PM, Mike Trienis wrote: > Hi All, > > I have cluster of four nodes (three workers and one master, with one core > each) which consumes data from K

Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data from Kinesis at 15 second intervals using two streams (i.e. receivers). The job simply grabs the latest batch and pushes it to MongoDB. I believe that the problem is that all tasks are execu

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
evant spark streaming logs that are > generated when you do this? > > I saw a lot of "lease not owned by this Kinesis Client" type of errors, > from what I remember. > > lemme know! > > -Chris > > On May 8, 2015, at 4:36 PM, Mike Trienis wrote: > > &

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
you see errors, you may need to manually delete the DynamoDB table.* On Fri, May 8, 2015 at 2:06 PM, Mike Trienis wrote: > Hi All, > > I am submitting the assembled fat jar file by the command: > > bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.

Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hi All, I am submitting the assembled fat jar file by the command: bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar --class com.xxx.Consumer -0.1-SNAPSHOT.jar It reads the data file from kinesis using the stream name defined in a configuration file. It turns out that it re

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
;lib' directory > has an uber jar spark-assembly-1.3.0-hadoop1.0.4.jar. At one point in Spark > 1.2 I found a conflict between httpclient versions that my uber jar pulled > in for AWS libraries and the one bundled in the spark uber jar. I hand > patched the spark uber jar to rem

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
roduce an "uber jar." > > Fyi, I've been having trouble consuming data out of Kinesis with Spark > with no success :( > Would be curious to know if you got it working. > > Vadim > > On Apr 13, 2015, at 9:36 PM, Mike Trienis wrote: > > Hi All, > > I

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
> with no success :( > Would be curious to know if you got it working. > > Vadim > > On Apr 13, 2015, at 9:36 PM, Mike Trienis wrote: > > Hi All, > > I have having trouble building a fat jar file through sbt-assembly. > > [warn] Merging 'META-INF/NOTICE.

sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn] Merging 'META-INF/NOTICE' with strategy 'rename' [warn] Merging 'META-INF/LICENSE.txt' with strategy 'rename' [warn] Merging 'META-INF/LICENSE' with strat

Re: Cannot run unit test.

2015-04-08 Thread Mike Trienis
It's because your tests are running in parallel and you can only have one context running at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
ch > interval). > > this goes for any spark streaming implementation - not just Kinesis. > > lemme know if that works for you. > > thanks! > > -Chris > _ > From: Mike Trienis > Sent: Wednesday, March 18, 2015 2:45 PM > Subject: S

Spark Streaming S3 Performance Implications

2015-03-18 Thread Mike Trienis
Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at

Re: Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Please ignore my question, you can simply specify the root directory and it looks like redshift takes care of the rest. copy mobile from 's3://BUCKET_NAME/' credentials json 's3://BUCKET_NAME/jsonpaths.json' On Thu, Mar 5, 2015 at 3:33 PM, Mike Trienis wrote: > Hi

Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Hi All, I am receiving data from AWS Kinesis using Spark Streaming and am writing the data collected in the dstream to s3 using output function: dstreamData.saveAsTextFiles("s3n://XXX:XXX@/") After the run the application for several seconds, I end up with a sequence of directories in S3 tha

Pushing data from AWS Kinesis -> Spark Streaming -> AWS Redshift

2015-03-01 Thread Mike Trienis
Hi All, I am looking at integrating a data stream from AWS Kinesis to AWS Redshift and since I am already ingesting the data through Spark Streaming, it seems convenient to also push that data to AWS Redshift at the same time. I have taken a look at the AWS kinesis connector although I am not sur

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking is

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking is

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Mike Trienis
t;> Happy hacking >> >> Chris >> >> Von: Franc Carter >> Datum: Mittwoch, 11. Februar 2015 10:03 >> An: Paolo Platter >> Cc: Mike Trienis , "user@spark.apache.org" < >> user@spark.apache.org> >> Betreff: Re: Datastore HDFS v

Datastore HDFS vs Cassandra

2015-02-10 Thread Mike Trienis
Hi, I am considering implement Apache Spark on top of Cassandra database after listing to related talk and reading through the slides from DataStax. It seems to fit well with our time-series data and reporting requirements. http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-fo