Re: Specify log4j properties file

2016-03-09 Thread Matt Narrell
You can also use --files, which doesn't require the file scheme. On Wed, Mar 9, 2016 at 11:20 AM Ashic Mahtab wrote: > Found it. > > You can pass in the jvm parameter log4j.configuration. The following works: > > -Dlog4j.configuration=file:path/to/log4j.properties > > It doesn't work without the

Re: laziness in textFile reading from HDFS?

2015-10-03 Thread Matt Narrell
Is there any more information or best practices here? I have the exact same issues when reading large data sets from HDFS (larger than available RAM) and I cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, and using nearly all the cluster resources. Should I repartit

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
eason for caching the RDD? How many passes you make > over the dataset? > > Mohammed > > -Original Message----- > From: Matt Narrell [mailto:matt.narr...@gmail.com] > Sent: Saturday, October 3, 2015 9:50 PM > To: Mohammed Guller > Cc: davidkl; user@spark.a

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
gt; save operation, I don't see how caching would help. > > Mohammed > > > -Original Message- > From: Matt Narrell [mailto:matt.narr...@gmail.com] > Sent: Tuesday, October 6, 2015 3:32 PM > To: Mohammed Guller > Cc: davidkl; user@spark.apache.org >

[Spark ML] HasInputCol, etc.

2015-07-28 Thread Matt Narrell
Hey, Our ML ETL pipeline has several complex steps that I’d like to address with custom Transformers in an ML Pipeline. Looking at the Tokenizer and HashingTF transformers I see these handy traits (HasInputCol, HasLabelCol, HasOutputCol, etc.) but they have strict access modifiers. How can I

Re: Spark RDD join with CassandraRDD

2015-08-25 Thread Matt Narrell
I would suggest converting your RDDs to Dataframes (or SchemaRDDs depending on your version) and performing a native join. mn > On Aug 25, 2015, at 9:22 AM, Priya Ch wrote: > > Hi All, > > I have the following scenario: > > There exists a booking table in cassandra, which holds the field

Spark Streaming on YARN with loss of application master

2015-03-30 Thread Matt Narrell
I’m looking at various HA scenarios with Spark streaming. We’re currently running a Spark streaming job that is intended to be long-lived, 24/7. We see that if we kill node managers that are hosting Spark workers, new node managers assume execution of the jobs that were running on the stopped

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Matt Narrell
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or something similar to allow for management, in particular restart on failure. mn On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer wrote: > Hi, > > On Thu, Aug 14, 2014 at 5:49 AM, salemi wrote: > what is the best way to make

spark-submit with HA YARN

2014-08-18 Thread Matt Narrell
Hello, I have an HA enabled YARN cluster with two resource mangers. When submitting jobs via “spark-submit —master yarn-cluster”. It appears that the driver is looking explicitly for the "yarn.resourcemanager.address” property rather than round robin-ing through the resource managers via the

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
att, > > I checked in the YARN code and I don't see any references to > yarn.resourcemanager.address. Have you made sure that your YARN client > configuration on the node you're launching from contains the right configs? > > -Sandy > > > On Mon, Aug

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
d, Aug 20, 2014 at 8:54 AM, Matt Narrell wrote: >> However, now the Spark jobs running in the ApplicationMaster on a given node >> fails to find the active resourcemanager. Below is a log excerpt from one >> of the assigned nodes. As all the jobs fail, eventually YARN will move thi

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Ok Marcelo, Thanks for the quick and thorough replies. I’ll keep an eye on these tickets and the mailing list to see how things move along. mn On Aug 20, 2014, at 1:33 PM, Marcelo Vanzin wrote: > Hi, > > On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell wrote: >> Specifying t

Re: Spark on YARN question

2014-09-02 Thread Matt Narrell
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to the HDFS location of the jar. I’m not using any specialized configuration files (like spark-env.sh), but rather setting things either by environment variable per node, passing application arguments to the job, or ma

Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
Hello, I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows. Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic. When I include these 3rd party drivers, I usually get a NotSerializableException due to the

Re: Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
l with things > like drivers in an efficient way that doesn't trip over serialization. > > On Tue, Sep 2, 2014 at 5:45 PM, Matt Narrell wrote: >> Hello, >> >> I’m using Spark streaming to aggregate data from a Kafka topic in sliding >> windows. Usually we want to pe

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-08 Thread Matt Narrell
I came across this: https://github.com/xerial/sbt-pack Until i found this, I was simply using the sbt-assembly plugin (sbt clean assembly) mn On Sep 4, 2014, at 2:46 PM, Aris wrote: > Thanks for answering Daniil - > > I have SBT version 0.13.5, is that an old version? Seems pretty up-to-da

Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1 I have a five partition Kafka topic. I can create a single Kafka receiver via KafkaUtils.createStream with five threads in the topic map and consume messages fine. Sifting through the user list and Google, I see that its possible to spl

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
at 2:55 PM, Tim Smith wrote: > Posting your code would be really helpful in figuring out gotchas. > > On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell wrote: >> Hey, >> >> Spark 1.1.0 >> Kafka 0.8.1.1 >> Hadoop (YARN/HDFS) 2.5.1 >> >> I have a fi

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
Collections.singletonMap(topic, >> 5), >> // StorageLevel.MEMORY_ONLY_SER()); >> >> final JavaPairDStream tuples = stream.mapToPair( >>new PairFunction, String, Integer>() { >>@Override >>publ

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Matt Narrell
and look at Spark or YarnContainer logs > (depending on who's doing RM), you should be able to see if the > receiver has any errors when trying to talk to kafka. > > > > On Tue, Sep 23, 2014 at 3:21 PM, Matt Narrell wrote: >> To my eyes, these are functionally eq

Re: Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Matt Narrell
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak wrote: > Hello, > if our Hadoop cluster is configured with HA and "fs.defaultFS" points to a > namespace instead of a namenode hostname - hdfs:/// - then > our Spark job

Re: Spark with YARN

2014-09-24 Thread Matt Narrell
This just shows the driver. Click the Executors tab in the Spark UI mn On Sep 24, 2014, at 11:25 AM, Raghuveer Chanda wrote: > Hi, > > I'm new to spark and facing problem with running a job in cluster using YARN. > > Initially i ran jobs using spark master as --master spark://dml2:7077 and

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
and JavaPairReceiverInputDStream? > > On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell wrote: >> The part that works is the commented out, single receiver stream below the >> loop. It seems that when I call KafkaUtils.createStream more than once, I >> don’t receive any messages. >>

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
, there are none left to do the processing. If I drop the number of partitions/receivers down while still having multiple unioned receivers, I see messages. mn On Sep 25, 2014, at 10:18 AM, Matt Narrell wrote: > I suppose I have other problems as I can’t get the Scala example to work >

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
Additionally, If I dial up/down the number of executor cores, this does what I want. Thanks for the extra eyes! mn On Sep 25, 2014, at 12:34 PM, Matt Narrell wrote: > Tim, > > I think I understand this now. I had a five node Spark cluster and a five > partition topic, and I

Re: SPARK UI - Details post job processiong

2014-09-25 Thread Matt Narrell
How does this work with a cluster manager like YARN? mn On Sep 25, 2014, at 2:23 PM, Andrew Or wrote: > Hi Harsha, > > You can turn on `spark.eventLog.enabled` as documented here: > http://spark.apache.org/docs/latest/monitoring.html. Then, if you are running > standalone mode, you can acces

Re: SPARK UI - Details post job processiong

2014-09-26 Thread Matt Narrell
ified place instead of to local disk on a random box on the cluster. > > On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell wrote: > How does this work with a cluster manager like YARN? > > mn > > On Sep 25, 2014, at 2:23 PM, Andrew Or wrote: > >> Hi Harsha, >>

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Matt Narrell
http://spark.apache.org/docs/latest/streaming-programming-guide.html foreachRDD is executed on the driver…. mn > On Oct 20, 2014, at 3:07 AM, Gerard Maas wrote: > > Pinging TD -- I'm sure you know :-) > > -kr, Gerard. >

Re: Scala Spark IDE help

2014-10-28 Thread Matt Narrell
So, Im using Intellij 13.x, and Scala Spark jobs. Make sure you have singletons (objects, not classes), then simply debug the main function. You’ll need to set your master to some derivation of “local”, but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest behaves like

Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell
I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the contex

Re: Submiting Spark application through code

2014-10-28 Thread Matt Narrell
Can this be done? Can I just spin up a SparkContext programmatically, point this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least) the application JAR need to be distributed to the workers via HDFS or the like for the jobs to run? mn > On Oct 28, 2014, at 2:29 AM, Akhi

Re: spark-submit and logging

2014-11-20 Thread Matt Narrell
How do I configure the files to be uploaded to YARN containers. So far, I’ve only seen "--conf spark.yarn.jar=hdfs://….” which allows me to specify the HDFS location of the Spark JAR, but I’m not sure how to prescribe other files for uploading (e.g., spark-env.sh) mn > On Nov 20, 2014, at 4:0

Re: Execute Spark programs from local machine on Yarn-hadoop cluster

2014-11-23 Thread Matt Narrell
I think this IS possible? You must set the HADOOP_CONF_DIR variable on the machine you’re running the Java process that creates the SparkContext. The Hadoop configuration specifies the YARN ResourceManager IPs, and Spark will use that configuration. mn > On Nov 21, 2014, at 8:10 AM, Prannoy

Re: Spark Job submit

2014-12-01 Thread Matt Narrell
Or setting the HADOOP_CONF_DIR property. Either way, you must have the YARN configuration available to the submitting application to allow for the use of “yarn-client” or “yarn-master” The attached stack trace below doesn’t provide any information as to why the job failed. mn > On Nov 27, 20

Re: Is there a way to force spark to use specific ips?

2014-12-06 Thread Matt Narrell
Its much easier if you access your nodes by name. If you’re using Vagrant, use the hosts provisioner: https://github.com/adrienthebo/vagrant-hosts mn > On Dec 6, 2014, at 8:37 AM, Ashic Mahtab wrote: > > Hi, > It appears that spark is always at