Hello all,
This is probably me doing something obviously wrong, would really
appreciate some pointers on how to fix this.
I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [
https://spark.apache.org/downloads.html] and just untarred it on a local
drive. I am on Mac OSX 10.9.5
make this permanent I put this in conf/spark-env.sh.
-sujit
On Sat, May 23, 2015 at 8:14 AM, Sujit Pal wrote:
> Hello all,
>
> This is probably me doing something obviously wrong, would really
> appreciate some pointers on how to fix this.
>
> I installed spark-1.3.1-bin-
Hi Pierre,
One way is to recreate your credentials until AWS generates one without a
slash character in it. Another way I've been using is to pass these
credentials outside the S3 file path by setting the following (where sc is
the SparkContext).
sc._jsc.hadoopConfiguration().set("fs.s3n.awsA
Hi Rexx,
In general (ie not Spark specific), its best to convert categorical data to
1-hot encoding rather than integers - that way the algorithm doesn't use
the ordering implicit in the integer representation.
-sujit
On Tue, Jun 16, 2015 at 1:17 PM, Rex X wrote:
> Is it necessary to convert
Hi Rex,
If the CSV files are in the same folder and there are no other files,
specifying the directory to sc.textFiles() (or equivalent) will pull in all
the files. If there are other files, you can pass in a pattern that would
capture the two files you care about (if thats possible). If neither o
Hi Julian,
I recently built a Python+Spark application to do search relevance
analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
EC2 (so I don't use the PySpark shell, hopefully thats what you are looking
for). Can't share the code, but the basic approach is covered in this
Jul 8, 2015 at 9:59 AM, Sujit Pal wrote:
> > Hi Julian,
> >
> > I recently built a Python+Spark application to do search relevance
> > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster
> on
> > EC2 (so I don't use the PySpark shell, ho
> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
> FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN",
> MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"
>
>
rong here. Cannot seem to figure out, what
> is it?
>
> Thank you for your help
>
>
> Sincerely,
> Ashish Dutt
>
> On Thu, Jul 9, 2015 at 11:53 AM, Sujit Pal wrote:
>
>> Hi Ashish,
>>
>> >> Nice post.
>> Agreed, kudos to the author of th
enshot of the error
> message 7.png
>
> Hope you can help me out to fix this problem.
> Thank you for your time.
>
> Sincerely,
> Ashish Dutt
> PhD Candidate
> Department of Information Systems
> University of Malaya, Lembah Pantai,
> 50603 Kuala Lumpur, Mala
ult.
> Am i correct in this visualization ?
>
> Once again, thank you for your efforts.
>
>
> Sincerely,
> Ashish Dutt
> PhD Candidate
> Department of Information Systems
> University of Malaya, Lembah Pantai,
> 50603 Kuala Lumpur, Malaysia
>
> On Fri, Jul 10,
Hi Roberto,
I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:
sc = Spa
ide the keys?
>
>
>
> Thank you,
>
>
>
>
>
> *From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
> *Sent:* Tuesday, July 14, 2015 3:14 PM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on EMR with S3 example (Python)
>
>
&g
Hi Wush,
One option may be to try a replicated join. Since your rdd1 is small, read
it into a collection and broadcast it to the workers, then filter your
larger rdd2 against the collection on the workers.
-sujit
On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain wrote:
> Leftouterjoin and join ap
Hi Schmirr,
The part after the s3n:// is your bucket name and folder name, ie
s3n://${bucket_name}/${folder_name}[/${subfolder_name}]*. Bucket names are
unique across S3, so the resulting path is also unique. There is no concept
of hostname in s3 urls as far as I know.
-sujit
On Fri, Jul 17, 20
Hi Charles,
I tried this with dummied out functions which just sum transformations of a
list of integers, maybe they could be replaced by algorithms in your case.
The idea is to call them through a "god" function that takes an additional
type parameter and delegates out to the appropriate function
I built this recently using the accepted answer on this SO page:
http://stackoverflow.com/questions/26741714/how-does-the-pyspark-mappartitions-function-work/26745371
-sujit
On Sat, May 14, 2016 at 7:00 AM, Mathieu Longtin
wrote:
> From memory:
> def processor(iterator):
> for item in iterat
Hi Sebastian,
You can save models to disk and load them back up. In the snippet below
(copied out of a working Databricks notebook), I train a model, then save
it to disk, then retrieve it back into model2 from disk.
import org.apache.spark.mllib.tree.RandomForest
> import org.apache.spark.mllib.
Hi Bin,
Very likely the RedisClientPool is being closed too quickly before map has
a chance to get to it. One way to verify would be to comment out the .close
line and see what happens. FWIW I saw a similar problem writing to Solr
where I put a commit where you have a close, and noticed that the c
Hi Zhiliang,
For a system of equations AX = y, Linear Regression will give you a
best-fit estimate for A (coefficient vector) for a matrix of feature
variables X and corresponding target variable y for a subset of your data.
OTOH, what you are looking for here is to solve for x a system of equatio
Hi Alexander,
You may want to try the wholeTextFiles() method of SparkContext. Using that
you could just do something like this:
sc.wholeTextFiles("hdfs://input_dir")
> .saveAsSequenceFile("hdfs://output_dir")
The wholeTextFiles returns a RDD of ((filename, content)).
http://spark.apache.or
Graphs, Content as a
Service, Content and Event Analytics, Content/Event based Predictive Models
and Big Data Processing. We use Scala and Python over Databricks Notebooks
for most of our work.
Thanks very much,
Sujit Pal
Technical Research Director
Elsevier Labs
sujit@elsevier.com
, Content and Event Analytics, Content/Event based Predictive Models
and Big Data Processing. We use Scala and Python over Databricks Notebooks
for most of our work.
Thanks very much,
Sujit
On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal wrote:
> Hello,
>
> We have been using Spark at Else
6 AM, Sean Owen wrote:
> Not sure who generally handles that, but I just made the edit.
>
> On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote:
> > Sorry to be a nag, I realize folks with edit rights on the Powered by
> Spark
> > page are very busy people, but its been 1
Hi Ningjun,
Haven't done this myself, saw your question and was curious about the
answer and found this article which you might find useful:
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
According this article, you can pass in your SQL statement in
Hello,
I am trying to run a Spark job that hits an external webservice to get back
some information. The cluster is 1 master + 4 workers, each worker has 60GB
RAM and 4 CPUs. The external webservice is a standalone Solr server, and is
accessed using code similar to that shown below.
def getResult
advance for any help you can provide.
-sujit
On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal wrote:
> Hello,
>
> I am trying to run a Spark job that hits an external webservice to get
> back some information. The cluster is 1 master + 4 workers, each worker has
> 60GB RAM and 4 CPU
AM, Igor Berman wrote:
> What kind of cluster? How many cores on each worker? Is there config for
> http solr client? I remember standard httpclient has limit per route/host.
> On Aug 2, 2015 8:17 PM, "Sujit Pal" wrote:
>
>> No one has any ideas?
>>
>>
nd
> - so partitions is essentially helping your work size rather than execution
> parallelism).
>
> [Disclaimer: I am no authority on Spark, but wanted to throw my spin based
> my own understanding].
>
> Nothing official about it :)
>
> -abhishek-
>
> On Jul 31, 2015,
Hi Hao,
I think sc.broadcast will allow you to broadcast non-serializable objects.
According to the scaladocs the Broadcast class itself is Serializable and
it wraps your object, allowing you to get it from the Broadcast object
using value().
Not 100% sure though since I haven't tried broadcastin
Hi Saif,
Would this work?
import scala.collection.JavaConversions._
new java.math.BigDecimal(5) match { case x: java.math.BigDecimal =>
x.doubleValue }
It gives me on the scala console.
res9: Double = 5.0
Assuming you had a stream of BigDecimals, you could just call map on it.
myBigDecimals.
Hi Zhiliang,
Would something like this work?
val rdd2 = rdd1.sliding(2).map(v => v(1) - v(0))
-sujit
On Mon, Sep 21, 2015 at 7:58 AM, Zhiliang Zhu
wrote:
> Hi Romi,
>
> Thanks very much for your kind help comment~~
>
> In fact there is some valid backgroud of the application, it is about R
>
ver, do you know the corresponding spark Java API
> achievement...
> Is there any java API as scala sliding, and it seemed that I do not find
> spark scala's doc about sliding ...
>
> Thank you very much~
> Zhiliang
>
>
>
> On Monday, September 21, 2015 11:
Hi Tapan,
Perhaps this may work? It takes a range of 0..100 and creates an RDD out of
them, then calls X(i) on each. The X(i) should be executed on the workers
in parallel.
Scala:
val results = sc.parallelize(0 until 100).map(idx => X(idx))
Python:
results = sc.parallelize(range(100)).map(lambda
Hi Zhiliang,
How about doing something like this?
val rdd3 = rdd1.zip(rdd2).map(p =>
p._1.zip(p._2).map(z => z._1 - z._2))
The first zip will join the two RDDs and produce an RDD of (Array[Float],
Array[Float]) pairs. On each pair, we zip the two Array[Float] components
together to form an A
Hi Janardhan,
Maybe try removing the string "test" from this line in your build.sbt?
IIRC, this restricts the models JAR to be called from a test.
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier
"models",
-sujit
On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty
wrot
ly(ScalaUDF.scala:87)
> at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(
> ScalaUDF.scala:1060)
> at org.apache.spark.sql.catalyst.expressions.Alias.eval(
> namedExpressions.scala:142)
> at org.apache.spark.sql.catalyst.expressions.
> InterpretedProjection
37 matches
Mail list logo