I'm following the tutorial about Apache Spark on EC2. The output is the
following:
$ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training
Setting up security groups...
Searching for existing cluster spark-training...
Latest Spark AMI: ami-19474270
Launching ins
Is there an example about how to load data from a public S3 bucket in Python? I
haven't found any.
Thank you,
Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide
the keys?
Thank you,
From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: Spark on EMR with S3 example (Python
Is there a way to set the cost value C when using linear SVM?
Can anybody point me to an example, if available, about gridsearch with python?
Thank you,
I know grid search with cross validation is not supported. However, I was
wondering if there is something availalable for the time being.
Thanks,
From: Punyashloka Biswal [mailto:punya.bis...@gmail.com]
Sent: Thursday, April 23, 2015 9:06 PM
To: Pagliari, Roberto; user@spark.apache.org
Subject
I have an RDD of LabledPoints.
Is it possible to select a subset of it based on a list of indeces?
For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with
elements 0,4,5,6 and 8.
-
To unsubscribe, e-mail:
values and preserve the original ones.
Thank you,
From: Sven Krasser [mailto:kras...@gmail.com]
Sent: Friday, April 24, 2015 5:56 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: indexing an RDD [Python]
The solution depends largely on your use case. I assume the index is in the
key
Suppose I have something like the code below
for idx in xrange(0, 10):
train_test_split = training.randomSplit(weights=[0.75, 0.25])
train_cv = train_test_split[0]
test_cv = train_test_split[1]
# scale train_cv and test_cv
by scaling train
With the Python APIs, the available arguments I got (using inspect module) are
the following:
['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights',
'regParam', 'regType', 'intercept']
numClasses is not available. Can someone comment on this?
Thanks,
I'm a newbie with Spark. After installing it on all the machines I want to use,
do I need to tell it about Hadoop configuration, or will it be able to find it
himself?
Thank you,
:08 PM
To: Pagliari, Roberto
Cc: u...@spark.incubator.apache.org
Subject: Re: Spark SQL configuration
You can write `HADOOP_CONF_DIR=your_hadoop_conf_path` to `conf/spark-env.sh` to
enable:
1 connect to your yarn cluster
2 set `hdfs` as default FileSystem, otherwise you have to write “hdfs
If I already have hive running on Hadoop, do I need to build Hive using
sbt/sbt -Phive assembly/assembly
command?
If the answer is no, how do I tell spark where hive home is?
Thanks,
Is there a repo or some kind of instruction about how to install sbt for centos?
Thanks,
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive
option to be able to interface with hive)
I'm getting this
ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop
it first.
Am I doing something wrong? In my specific case, shark+hive is ru
2014 at 4:32 PM, Pagliari, Roberto
mailto:rpagli...@appcomsci.com>> wrote:
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive
option to be able to interface with hive)
I’m getting this
ip_address: org.apache.spark.deploy.worker.Worker running as process .
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave..
that might be an issue as well..
Thanks,
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Thursday, October 30, 2014 11:27 AM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: problem with start
I'm using this system
Hadoop 1.0.4
Scala 2.9.3
Hive 0.9.0
With spark 1.1.0. When importing pyspark, I'm getting this error:
>>> from pyspark.sql import *
Traceback (most recent call last):
File "", line 1, in ?
File "//spark-1.1.0/python/pyspark/__init__.py", line 63, in ?
from pyspark.
I'm not on the cluster now so I cannot check. What is the minimum requirement
for Python?
Thanks,
From: Davies Liu [dav...@databricks.com]
Sent: Wednesday, November 05, 2014 7:41 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subjec
I'm getting this error when importing hive context
>>> from pyspark.sql import HiveContext
Traceback (most recent call last):
File "", line 1, in
File "/path/spark-1.1.0/python/pyspark/__init__.py", line 63, in
from pyspark.context import SparkContext
File "/path/spark-1.1.0/python/pys
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and
hive 0.9.0.
When using python 2.7
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
I'm getting 'sc not defined'
On the other hand, I can see 'sc' from pyspark CLI.
Is there a way to fix it?
I'm executing this example from the documentation (in single node mode)
# sc is an existing SparkContext.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
# Queries can be expressed in HiveQL.
results = sqlC
22 matches
Mail list logo