Hi Folks!
I am trying to implement a spark job to calculate the similarity of my database
products, using only name and descriptions.
I would like to use TF-IDF to represent my text data and cosine similarity to
calculate all similarities.
My goal is, after job completes, get all similarities a
Hi Folks!
I'm running a spark JOB on a cluster with 9 slaves and 1 master (250GB RAM,
32 cores each and 1TB of storage each).
This job generates 1.200 TB of data on a RDD with 1200 partitions.
When I call saveAsTextFile("hdfs://..."), spark creates 1200 files named
"part-000*" on HDFS's folder. H
Hi Folks!
I'm running a Python Spark job on a cluster with 1 master and 10 slaves
(64G RAM and 32 cores each machine).
This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and
call Kmeans method as following:
# SLAVE CODE - Reading features from HDFS
def get_features_from
014-11-18 16:18 GMT-02:00 Sean Owen :
> My guess is you're asking for all cores of all machines but the driver
> needs at least one core, so one executor is unable to find a machine to fit
> on.
> On Nov 18, 2014 7:04 PM, "Alan Prando" wrote:
>
>> Hi Folks!
>&
Hi Folks!
I'm running Spark on YARN cluster installed with Cloudera Manager Express.
The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G
RAM.
My spark's job is working fine, however it seems that just 2 of 3 slaves
are working (htop shows 2 slaves working 100% on 32 cores, a
Hi all,
I'm trying to read an hbase table using this an example from github (
https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py),
however I have two qualifiers in a column family.
Ex.:
ROW COLUMN+CELL row1 column=f1:1, timestamp=1401883411986, value=valu