Not able to run SparkPi locally

2015-05-23 Thread Sujit Pal
Hello all, This is probably me doing something obviously wrong, would really appreciate some pointers on how to fix this. I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [ https://spark.apache.org/downloads.html] and just untarred it on a local drive. I am on Mac OSX 10.9.5

Re: Not able to run SparkPi locally

2015-05-23 Thread Sujit Pal
make this permanent I put this in conf/spark-env.sh. -sujit On Sat, May 23, 2015 at 8:14 AM, Sujit Pal wrote: > Hello all, > > This is probably me doing something obviously wrong, would really > appreciate some pointers on how to fix this. > > I installed spark-1.3.1-bin-

Re: Access several s3 buckets, with credentials containing "/"

2015-06-06 Thread Sujit Pal
Hi Pierre, One way is to recreate your credentials until AWS generates one without a slash character in it. Another way I've been using is to pass these credentials outside the S3 file path by setting the following (where sc is the SparkContext). sc._jsc.hadoopConfiguration().set("fs.s3n.awsA

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Sujit Pal
Hi Rexx, In general (ie not Spark specific), its best to convert categorical data to 1-hot encoding rather than integers - that way the algorithm doesn't use the ordering implicit in the integer representation. -sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X wrote: > Is it necessary to convert

Re: HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Sujit Pal
Hi Rex, If the CSV files are in the same folder and there are no other files, specifying the directory to sc.textFiles() (or equivalent) will pull in all the files. If there are other files, you can pass in a pattern that would capture the two files you care about (if thats possible). If neither o

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the PySpark shell, hopefully thats what you are looking for). Can't share the code, but the basic approach is covered in this

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Jul 8, 2015 at 9:59 AM, Sujit Pal wrote: > > Hi Julian, > > > > I recently built a Python+Spark application to do search relevance > > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster > on > > EC2 (so I don't use the PySpark shell, ho

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM > FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN", > MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\" > >

Re: PySpark without PySpark

2015-07-09 Thread Sujit Pal
rong here. Cannot seem to figure out, what > is it? > > Thank you for your help > > > Sincerely, > Ashish Dutt > > On Thu, Jul 9, 2015 at 11:53 AM, Sujit Pal wrote: > >> Hi Ashish, >> >> >> Nice post. >> Agreed, kudos to the author of th

Re: PySpark without PySpark

2015-07-09 Thread Sujit Pal
enshot of the error > message 7.png > > Hope you can help me out to fix this problem. > Thank you for your time. > > Sincerely, > Ashish Dutt > PhD Candidate > Department of Information Systems > University of Malaya, Lembah Pantai, > 50603 Kuala Lumpur, Mala

Re: PySpark without PySpark

2015-07-10 Thread Sujit Pal
ult. > Am i correct in this visualization ? > > Once again, thank you for your efforts. > > > Sincerely, > Ashish Dutt > PhD Candidate > Department of Information Systems > University of Malaya, Lembah Pantai, > 50603 Kuala Lumpur, Malaysia > > On Fri, Jul 10,

Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Sujit Pal
Hi Roberto, I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this: sc = Spa

Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Sujit Pal
ide the keys? > > > > Thank you, > > > > > > *From:* Sujit Pal [mailto:sujitatgt...@gmail.com] > *Sent:* Tuesday, July 14, 2015 3:14 PM > *To:* Pagliari, Roberto > *Cc:* user@spark.apache.org > *Subject:* Re: Spark on EMR with S3 example (Python) > > &g

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Sujit Pal
Hi Wush, One option may be to try a replicated join. Since your rdd1 is small, read it into a collection and broadcast it to the workers, then filter your larger rdd2 against the collection on the workers. -sujit On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain wrote: > Leftouterjoin and join ap

Re: use S3-Compatible Storage with spark

2015-07-17 Thread Sujit Pal
Hi Schmirr, The part after the s3n:// is your bucket name and folder name, ie s3n://${bucket_name}/${folder_name}[/${subfolder_name}]*. Bucket names are unique across S3, so the resulting path is also unique. There is no concept of hostname in s3 urls as far as I know. -sujit On Fri, Jul 17, 20

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread Sujit Pal
Hi Charles, I tried this with dummied out functions which just sum transformations of a list of integers, maybe they could be replaced by algorithms in your case. The idea is to call them through a "god" function that takes an additional type parameter and delegates out to the appropriate function

Re: pyspark mappartions ()

2016-05-14 Thread Sujit Pal
I built this recently using the accepted answer on this SO page: http://stackoverflow.com/questions/26741714/how-does-the-pyspark-mappartitions-function-work/26745371 -sujit On Sat, May 14, 2016 at 7:00 AM, Mathieu Longtin wrote: > From memory: > def processor(iterator): > for item in iterat

Re: Save RandomForest Model from ML package

2015-10-22 Thread Sujit Pal
Hi Sebastian, You can save models to disk and load them back up. In the snippet below (copied out of a working Databricks notebook), I train a model, then save it to disk, then retrieve it back into model2 from disk. import org.apache.spark.mllib.tree.RandomForest > import org.apache.spark.mllib.

Re: How to close connection in mapPartitions?

2015-10-23 Thread Sujit Pal
Hi Bin, Very likely the RedisClientPool is being closed too quickly before map has a chance to get to it. One way to verify would be to comment out the .close line and see what happens. FWIW I saw a similar problem writing to Solr where I put a commit where you have a close, and noticed that the c

Re: How to get inverse Matrix / RDD or how to solve linear system of equations

2015-10-23 Thread Sujit Pal
Hi Zhiliang, For a system of equations AX = y, Linear Regression will give you a best-fit estimate for A (coefficient vector) for a matrix of feature variables X and corresponding target variable y for a subset of your data. OTOH, what you are looking for here is to solve for x a system of equatio

Re: Prevent possible out of memory when using read/union

2015-11-04 Thread Sujit Pal
Hi Alexander, You may want to try the wholeTextFiles() method of SparkContext. Using that you could just do something like this: sc.wholeTextFiles("hdfs://input_dir") > .saveAsSequenceFile("hdfs://output_dir") The wholeTextFiles returns a RDD of ((filename, content)). http://spark.apache.or

Please add us to the Powered by Spark page

2015-11-13 Thread Sujit Pal
Graphs, Content as a Service, Content and Event Analytics, Content/Event based Predictive Models and Big Data Processing. We use Scala and Python over Databricks Notebooks for most of our work. Thanks very much, Sujit Pal Technical Research Director Elsevier Labs sujit@elsevier.com

Re: Please add us to the Powered by Spark page

2015-11-23 Thread Sujit Pal
, Content and Event Analytics, Content/Event based Predictive Models and Big Data Processing. We use Scala and Python over Databricks Notebooks for most of our work. Thanks very much, Sujit On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal wrote: > Hello, > > We have been using Spark at Else

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sujit Pal
6 AM, Sean Owen wrote: > Not sure who generally handles that, but I just made the edit. > > On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > > Sorry to be a nag, I realize folks with edit rights on the Powered by > Spark > > page are very busy people, but its been 1

Re: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Sujit Pal
Hi Ningjun, Haven't done this myself, saw your question and was curious about the answer and found this article which you might find useful: http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ According this article, you can pass in your SQL statement in

How to increase parallelism of a Spark cluster?

2015-07-31 Thread Sujit Pal
Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def getResult

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal wrote: > Hello, > > I am trying to run a Spark job that hits an external webservice to get > back some information. The cluster is 1 master + 4 workers, each worker has > 60GB RAM and 4 CPU

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
AM, Igor Berman wrote: > What kind of cluster? How many cores on each worker? Is there config for > http solr client? I remember standard httpclient has limit per route/host. > On Aug 2, 2015 8:17 PM, "Sujit Pal" wrote: > >> No one has any ideas? >> >>

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Sujit Pal
nd > - so partitions is essentially helping your work size rather than execution > parallelism). > > [Disclaimer: I am no authority on Spark, but wanted to throw my spin based > my own understanding]. > > Nothing official about it :) > > -abhishek- > > On Jul 31, 2015,

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Sujit Pal
Hi Hao, I think sc.broadcast will allow you to broadcast non-serializable objects. According to the scaladocs the Broadcast class itself is Serializable and it wraps your object, allowing you to get it from the Broadcast object using value(). Not 100% sure though since I haven't tried broadcastin

Re: Scala: How to match a java object????

2015-08-18 Thread Sujit Pal
Hi Saif, Would this work? import scala.collection.JavaConversions._ new java.math.BigDecimal(5) match { case x: java.math.BigDecimal => x.doubleValue } It gives me on the scala console. res9: Double = 5.0 Assuming you had a stream of BigDecimals, you could just call map on it. myBigDecimals.

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
Hi Zhiliang, Would something like this work? val rdd2 = rdd1.sliding(2).map(v => v(1) - v(0)) -sujit On Mon, Sep 21, 2015 at 7:58 AM, Zhiliang Zhu wrote: > Hi Romi, > > Thanks very much for your kind help comment~~ > > In fact there is some valid backgroud of the application, it is about R >

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
ver, do you know the corresponding spark Java API > achievement... > Is there any java API as scala sliding, and it seemed that I do not find > spark scala's doc about sliding ... > > Thank you very much~ > Zhiliang > > > > On Monday, September 21, 2015 11:

Re: Calling a method parallel

2015-09-23 Thread Sujit Pal
Hi Tapan, Perhaps this may work? It takes a range of 0..100 and creates an RDD out of them, then calls X(i) on each. The X(i) should be executed on the workers in parallel. Scala: val results = sc.parallelize(0 until 100).map(idx => X(idx)) Python: results = sc.parallelize(range(100)).map(lambda

Re: How to subtract two RDDs with same size

2015-09-23 Thread Sujit Pal
Hi Zhiliang, How about doing something like this? val rdd3 = rdd1.zip(rdd2).map(p => p._1.zip(p._2).map(z => z._1 - z._2)) The first zip will join the two RDDs and produce an RDD of (Array[Float], Array[Float]) pairs. On each pair, we zip the two Array[Float] components together to form an A

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread Sujit Pal
Hi Janardhan, Maybe try removing the string "test" from this line in your build.sbt? IIRC, this restricts the models JAR to be called from a test. "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier "models", -sujit On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty wrot

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread Sujit Pal
ly(ScalaUDF.scala:87) > at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval( > ScalaUDF.scala:1060) > at org.apache.spark.sql.catalyst.expressions.Alias.eval( > namedExpressions.scala:142) > at org.apache.spark.sql.catalyst.expressions. > InterpretedProjection