Python:Streaming Question

2014-12-21 Thread Samarth Mailinglist
I’m trying to run the stateful network word count at https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py using the command: ./bin/spark-submit examples/src/main/python/streaming/stateful_network_wordcount.py localhost I am also running

Probability in Naive Bayes

2014-11-17 Thread Samarth Mailinglist
I am trying to use Naive Bayes for a project of mine in Python and I want to obtain the probability value after having built the model. Suppose I have two classes - A and B. Currently there is an API to to find which class a sample belongs to (predict). Now, I want to find the probability of it be

Re: spark-submit question

2014-11-17 Thread Samarth Mailinglist
4 4:59 AM, "Samarth Mailinglist" < > mailinglistsama...@gmail.com> wrote: > >> I am trying to run a job written in python with the following command: >> >> bin/spark-submit --master spark://localhost:7077 >> /path/spark_solution_basic.py --py-files /path/*.py

Re: Functions in Spark

2014-11-16 Thread Samarth Mailinglist
Check this video out: https://www.youtube.com/watch?v=dmL0N3qfSc8&list=UURzsq7k4-kT-h3TDUBQ82-w On Mon, Nov 17, 2014 at 9:43 AM, Deep Pradhan wrote: > Hi, > Is there any way to know which of my functions perform better in Spark? In > other words, say I have achieved same thing using two differen

spark-submit question

2014-11-16 Thread Samarth Mailinglist
I am trying to run a job written in python with the following command: bin/spark-submit --master spark://localhost:7077 /path/spark_solution_basic.py --py-files /path/*.py --files /path/config.properties I always get an exception that config.properties is not found: INFO - IOError: [Errno 2] No

Re: Scala vs Python performance differences

2014-11-12 Thread Samarth Mailinglist
I was about to ask this question. On Wed, Nov 12, 2014 at 3:42 PM, Andrew Ash wrote: > Jeremy, > > Did you complete this benchmark in a way that's shareable with those > interested here? > > Andrew > > On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >>

Re: Read a HDFS file from Spark source code

2014-11-11 Thread Samarth Mailinglist
Instead of a file path, use a HDFS URI. For example: (In Python) data = sc.textFile("hdfs://localhost/user/someuser/data") ​ On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek wrote: > Hi > > I am trying to access a file in HDFS from spark "source code". Basically, > I am tweaking the spark

Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist
p[i]] = values[i] > return json > > *doc_ids = data.mapPartitions(mapper)* > > > > > On Mon, May 19, 2014 at 8:00 AM, Samarth Mailinglist < > mailinglistsama...@gmail.com> wrote: > >> db = MongoClient()['spark_test_db'] >> *collec

Re: Using mongo with PySpark

2014-05-18 Thread Samarth Mailinglist
db = MongoClient()['spark_test_db'] *collec = db['programs']* def mapper(val): asc = val.encode('ascii','ignore') json = convertToJSON(asc, indexMap) collec.insert(json) # *this is not working* def convertToJSON(string, indexMap): values = string.strip().split(",") json = {}

Using mongo with PySpark

2014-05-17 Thread Samarth Mailinglist
Hi all, I am trying to store the results of a reduce into mongo. I want to share the variable "collection" in the mappers. Here's what I have so far (I'm using pymongo) db = MongoClient()['spark_test_db'] collec = db['programs'] db = MongoClient()['spark_test_db'] *collec = db['programs']* def