create table in hive from spark-sql

2015-09-23 Thread Mohit Singh
Probably a noob question. But I am trying to create a hive table using spark-sql. Here is what I am trying to do: hc = HiveContext(sc) hdf = hc.parquetFile(output_path) data_types = hdf.dtypes schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")" hc.sql(" CREATE TABLE IF

Re: Spark installation

2015-02-10 Thread Mohit Singh
For local machine, I dont think there is any to install.. Just unzip and go to $SPARK_DIR/bin/spark-shell and that will open up a repl... On Tue, Feb 10, 2015 at 3:25 PM, King sami wrote: > Hi, > > I'm new in Spark. I want to install it on my local machine (Ubunti 12.04) > Could you help me plea

Re: Need a partner

2015-02-10 Thread Mohit Singh
I would be interested too. On Tue, Feb 10, 2015 at 9:41 AM, Kartik Mehta wrote: > Hi Sami and fellow Spark friends, > > I too am looking for joint learning, online. > > I have set up spark but need to do on multi nodes on my home server. We > can form a group and do group learning? > > Thanks, >

Re: ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Mohit Singh
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py instead of normal "python pi.py" On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar wrote: > *Command:* > sudo python ./examples/src/main/python/pi.py > > *Error:* > Traceback (most recent call last): > File "./examples/src/m

is there a master for spark cluster in ec2

2015-01-28 Thread Mohit Singh
Hi, Probably a naive question.. But I am creating a spark cluster on ec2 using the ec2 scripts in there.. But is there a master param I need to set.. ./bin/pyspark --master [ ] ?? I don't yet fully understand the ec2 concepts so just wanted to confirm this?? Thanks -- Mohit "When you want succ

Using third party libraries in pyspark

2015-01-22 Thread Mohit Singh
Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File "/tmp/spark/python/pyspark/rdd.py", line , in take res = self.context.runJob(s

Re: How to create Track per vehicle using spark RDD

2014-10-14 Thread Mohit Singh
Perhaps, its just me but "lag" function isnt familiar to me .. But have you tried configuring the spark appropriately http://spark.apache.org/docs/latest/configuration.html On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar wrote: > Hi, > I have an RDD containing Vehicle Number , timestamp, Position.

Setting up jvm in pyspark from shell

2014-09-10 Thread Mohit Singh
Hi, I am using pyspark shell and am trying to create an rdd from numpy matrix rdd = sc.parallelize(matrix) I am getting the following error: JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2014/09/10 22:41:44 - please wait. JVMDUMP032I JVM requested Heap dump

Personalized Page rank in graphx

2014-08-20 Thread Mohit Singh
Hi, I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed ( https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it is.. but cant find the algo code ( https://github.com/amplab/graphx/tree

Re: Question on mappartitionwithsplit

2014-08-17 Thread Mohit Singh
Building on what Davies Liu said, How about something like: def indexing(splitIndex, iterator,*offset_lists* ): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator): index = count + offset + i for j, ele in enumerate(e

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
p of > that? > > > Appreciate your help > > -Sathish > > > On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh wrote: > >> My naive set up.. >> Adding >> os.environ['SPARK_HOME'] = "/path/to/spark" >> sys.path.append("/path/to/sp

Re: Regularization parameters

2014-08-06 Thread Mohit Singh
One possible straightforward explanation might be your solution(s) might be stuck in local minima?? And depending on your weights initialization, you are getting different parameters? Maybe have same initial weights for both the runs... or I would probably test the execution with synthetic dataset

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
My naive set up.. Adding os.environ['SPARK_HOME'] = "/path/to/spark" sys.path.append("/path/to/spark/python") on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet the

Reading hdf5 formats with pyspark

2014-07-28 Thread Mohit Singh
Hi, We have setup spark on a HPC system and are trying to implement some data pipeline and algorithms in place. The input data is in hdf5 (these are very high resolution brain images) and it can be read via h5py library in python. So, my current approach (which seems to be working ) is writing a

Spark "streaming"

2014-05-01 Thread Mohit Singh
Hi, I guess Spark is using streaming in context of streaming live data but what I mean is something more on the lines of hadoop streaming.. where one can code in any programming language? Or is something among that lines on the cards? Thanks -- Mohit "When you want success as badly as you wan

filter operation in pyspark

2014-03-03 Thread Mohit Singh
Hi, I have a csv file... (say "n" columns ) I am trying to do a filter operation like: query = rdd.filter(lambda x:x[1] == "1234") query.take(20) Basically this would return me rows with that specific value? This manipulation is taking quite some time to execute.. (if i can compare.. maybe slo

Re: Beginners Hadoop question

2014-03-03 Thread Mohit Singh
Not sure whether I understand your question correctly or not? If you are trying to use hadoop ( as in map reduce programming model), then basically you would have to use hadoop api's to solve your program. But if you have data stored in hdfs, and you want to use spark to process that data, then jus

Lazyoutput format in spark

2014-02-28 Thread Mohit Singh
Hi, Is there something equivalent of LazyOutputFormat equivalent in spark (pyspark) http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/LazyOutputFormat.html Basically, something like where I only save files which has some data in it rather than saving all the files as

Re: JVM error

2014-02-28 Thread Mohit Singh
ext*(conf = conf) > > > On Fri, Feb 28, 2014 at 9:37 AM, Mohit Singh wrote: > >> Hi Bryn, >> Thanks for the suggestion. >> I tried that.. >> conf = pyspark.SparkConf().set("spark.executor.memory","20G") >> But.. got an error

Re: JVM error

2014-02-28 Thread Mohit Singh
arning("Using SPARK_MEM to set amount of memory to use per > executor process is " + > > "deprecated, instead use spark.executor.memory") > > } > > > > Thanks, > > Bryn > > > > > > On Wed, Feb 26, 2014 at 6:28 PM, Mohit Singh

Re: JVM error

2014-02-26 Thread Mohit Singh
.setSparkHome("/your/path/to/spark") >.set("spark.executor.memory", "20G") >.set("spark.logConf", "true") > sc = pyspark.SparkConf(conf = conf) > > Hope that helps,

JVM error

2014-02-26 Thread Mohit Singh
Hi, I am experimenting with pyspark lately... Every now and then, I see this error bieng streamed to pyspark shell .. and most of the times.. the computation/operation completes.. and sometimes, it just gets stuck... My setup is 8 node cluster.. with loads of ram(256GB's) and space( TB's) per nod