Naive Bayes

2014-08-19 Thread Phuoc Do
I'm trying Naive Bayes classifier for Higg Boson challenge on Kaggle: http://www.kaggle.com/c/higgs-boson Here's the source code I'm working on: https://github.com/dnprock/SparkHiggBoson/blob/master/src/main/scala/KaggleHiggBosonLabel.scala Training data looks like this: 10,138.47,51.655,9

Performance problem on collect

2014-08-19 Thread Emmanuel Castanier
Hi all, I’m totally newbie on Spark, so my question may be a dumb one. I tried Spark to compute values, on this side all works perfectly (and it's fast :) ). At the end of the process, I have an RDD with Key(String)/Values(Array of String), on this I want to get only one entry like this : myRdd

Re: Cannot run program "Rscript" using SparkR

2014-08-19 Thread Shivaram Venkataraman
Hi Stuti Could you check if Rscript is installed on all of the worker machines in the Spark cluster ? You can ssh into the machines and check if Rscript can be found in $PATH. Thanks Shivaram On Mon, Aug 18, 2014 at 10:05 PM, Stuti Awasthi wrote: > Hi All, > > > > I am using R 3.1 and Spark

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do wrote: > I'm trying Na

Re: Performance problem on collect

2014-08-19 Thread Sean Owen
You can use the function lookup() to accomplish this too; it may be a bit faster. It will never be efficient like a database lookup since this is implemented by scanning through all of the data. There is no index or anything. On Tue, Aug 19, 2014 at 8:43 AM, Emmanuel Castanier wrote: > Hi all, >

Re: Problem in running a job on more than one workers

2014-08-19 Thread Rasika Pohankar
Hello, On the web UI of the master even though there are two workers shown, there is only one executor. There is an executor for machine1 but no executor for machine2. Hence if only machine1 is added as a worker the program runs but if only machine2 is added, it fails with the same error 'Master r

Re: Performance problem on collect

2014-08-19 Thread Emmanuel Castanier
Thanks for your answer. In my case, that’s sad cause we have only 60 entries in the final RDD, I was thinking it will be fast to get the needed one. Le 19 août 2014 à 09:58, Sean Owen a écrit : > You can use the function lookup() to accomplish this too; it may be a > bit faster. > > It will n

Re: Processing multiple files in parallel

2014-08-19 Thread Sean Owen
sc.textFile already returns just one RDD for all of your files. The sc.union is unnecessary, although I don't know if it's adding any overhead. The data is certainly processed in parallel and how it is parallelized depends on where the data is -- how many InputSplits Hadoop produces for them. If y

Re: Performance problem on collect

2014-08-19 Thread Sean Owen
In that case, why not collectAsMap() and have the whole result as a simple Map in memory? then lookups are trivial. RDDs aren't distributed maps. On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier wrote: > Thanks for your answer. > In my case, that’s sad cause we have only 60 entries in the fina

RE: Cannot run program "Rscript" using SparkR

2014-08-19 Thread Stuti Awasthi
Thanks Shivaram, This was the issue. Now I have installed Rscript on all the nodes in Spark cluster and it works now bith from script as well as R prompt. Thanks Stuti Awasthi From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Tuesday, August 19, 2014 1:17 PM To: Stuti Awast

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Patrick McGloin
Hi Amit, I think the type of the data contained in your RDD needs to be a known case class and not abstract for createSchemaRDD. This makes sense when you think it needs to know about the fields in the object to create the schema. I had the same issue when I used an abstract base class for a col

Partitioning under Spark 1.0.x

2014-08-19 Thread losmi83
Hi guys, I want to create two RDD[(K, V)] objects and then collocate partitions with the same K on one node. When the same partitioner for two RDDs is used, partitions with the same K end up being on different nodes. Here is a small example that illustrates this: // Let's say I have 10 nodes

spark error when distinct on more than one cloume

2014-08-19 Thread wan...@testbird.com
sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group by app_id  Error Log 14/08/19 17:58:26 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 158.6 KB, free 294.7 MB) Exception in thread "main" java.lang.RuntimeException: [1.36] failure: ``

Executor Memory, Task hangs

2014-08-19 Thread Laird, Benjamin
Hi all, I'm doing some testing on a small dataset (HadoopRDD, 2GB, ~10M records), with a cluster of 3 nodes Simple calculations like count take approximately 5s when using the default value of executor.memory (512MB). When I scale this up to 2GB, several Tasks take 1m or more (while most still

Re: Executor Memory, Task hangs

2014-08-19 Thread Akhil Das
Looks like 1 worker is doing the job. Can you repartition the RDD? Also what is the number of cores that you allocated? Things like this, you can easily identify by looking at the workers webUI (default worker:8081) Thanks Best Regards On Tue, Aug 19, 2014 at 6:35 PM, Laird, Benjamin < benjamin.

Re: Executor Memory, Task hangs

2014-08-19 Thread Sean Owen
Given a fixed amount of memory allocated to your workers, more memory per executor means fewer executors can execute in parallel. This means it takes longer to finish all of the tasks. Set high enough, and your executors can find no worker with enough memory and so they all are stuck waiting for re

Re: Executor Memory, Task hangs

2014-08-19 Thread Laird, Benjamin
Thanks Akhil and Sean. All three workers are doing the work and tasks stall simultaneously on all three. I think Sean hit on my issue. I've been under the impression that each application has one executor process per worker machine (not per core per machine). Is that incorrect? If an executor i

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-19 Thread Peng Cheng
Unfortunately, After some research I found its just a side effect of how closure containing var works in scala: http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined the closure keep referring var broadcasted wrapper as a pointer,

How to configure SPARK_EXECUTOR_URI to access files from maprfs

2014-08-19 Thread Lee Strawther (lstrawth)
7;/usr/local/mesos-0.18.1/maprfs:///mapr/CLUSTER1/MAIN/tmp/spark-1.0.1-bin-mapr3.tgz' to '/tmp/mesos/slaves/20140815-101817-3334820618-5050-32618-0/frameworks/20140819-101702-3334820618-5050-16778-0001/executors/20140815-101817-3334820618-5050-32618-0/runs/e56fffbe-942d-4b15-a798-a0040

Re: Viewing web UI after fact

2014-08-19 Thread Grzegorz Białek
Hi, Is there any way view history of applications statistics in master ui after restarting master server? I have all logs ing /tmp/spark-events/ but when I start history server in this directory it says "No Completed Applications Found". Maybe I could copy this logs to dir used by master server but

Re: Writing to RabbitMQ

2014-08-19 Thread jschindler
Thanks for the quick and clear response! I now have a better understanding of what is going on regarding the driver and worker nodes which will help me greatly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-to-RabbitMQ-tp11283p12386.html Sent from

spark on disk executions

2014-08-19 Thread Oleg Ruchovets
Hi , We have ~ 1TB of data to process , but our cluster doesn't have sufficient memory for such data set. ( we have 5-10 machine cluster). Is it possible to process 1TB data using ON DISK options using spark? If yes where can I read about the configuration for ON DISK executions. Thanks Oleg

Re: spark on disk executions

2014-08-19 Thread Sean Owen
Spark does not require that data sets fit in memory to begin with. Yes, there's nothing inherently problematic about processing 1TB data with a lot less than 1TB of cluster memory. You probably want to read: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Tue, Aug

Updating shared data structure between executors

2014-08-19 Thread Tim Smith
Hi, I am writing some Scala code to normalize a stream of logs using an input configuration file (multiple regex patterns). To avoid re-starting the job, I can read in a new config file using fileStream and then turn the config file to a map. But I am unsure about how to update a shared map (since

Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
Hello,I have a relatively simple python program that works just find in local most (--master local) but produces a strange error when I try to run it via Yarn ( --deploy-mode client --master yarn) or just execute the code through pyspark.Here's the code:sc = SparkContext(appName="foo")input = sc.te

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Davies Liu
Could you post the completed stacktrace? On Tue, Aug 19, 2014 at 10:47 AM, Aaron wrote: > Hello, I have a relatively simple python program that works just find in > local most (--master local) but produces a strange error when I try to run > it via Yarn ( --deploy-mode client --master yarn) or ju

Re: Failed jobs show up as succeeded in YARN?

2014-08-19 Thread Arun Ahuja
We see this all the time as well, I don't the believe there is much a relationship before the Spark job status and the what Yarn shows as the status. On Mon, Aug 11, 2014 at 3:17 PM, Shay Rojansky wrote: > Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0) > > It seems that Python jobs I'm sendin

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
Sure thing, this is the stacktrace from pyspark. It's repeated a few times, but I think this is the unique stuff. Traceback (most recent call last): File "", line 1, in File "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py", line 583, in collect bytesInJa

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-19 Thread Yana Kadiyska
Sean, would this work -- rdd.mapPartitions { partition => Iterator(partition) }.foreach( // Some setup code here // save partition to DB // Some cleanup code here ) I tried a pretty simple example ... I can see that the setup and cleanup are executed on the executor node, once per part

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Davies Liu
On Tue, Aug 19, 2014 at 11:20 AM, Aaron wrote: > Sure thing, this is the stacktrace from pyspark. It's repeated a few times, > but I think this is the unique stuff. > > Traceback (most recent call last): > File "", line 1, in > File > "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spa

EC2 instances missing SSD drives randomly?

2014-08-19 Thread Andras Barjak
Hi, Using the spark 1.0.1 ec2 script I launched 35 m3.2xlarge instances. (I was using Singapore region.) Some of the instances we got without the ephemeral internal (non-EBS) SSD devices that are supposed to be connected to them. Some of them have these drives but not all, and there is no sign fro

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-19 Thread Sean Owen
I think you're looking for foreachPartition(). You've kinda hacked it out of mapPartitions(). Your case has a simple solution, yes. After saving to the DB, you know you can close the connection, since you know the use of the connection has definitely just finished. But it's not a simpler solution f

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
These three lines of python code cause the error for me: sc = SparkContext(appName="foo") input = sc.textFile("hdfs://[valid hdfs path]") mappedToLines = input.map(lambda myline: myline.split(",")) The file I'm loading is a simple CSV. -- View this message in context: http://apach

noob: how to extract different members of a VertexRDD

2014-08-19 Thread spr
I'm a Scala / Spark / GraphX newbie, so may be missing something obvious. I have a set of edges that I read into a graph. For an iterative community-detection algorithm, I want to assign each vertex to a community with the name of the vertex. Intuitively it seems like I should be able to pull

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Davies Liu
This script run very well without your CSV file. Could download you CSV file into local disks, and narrow down to the lines which triggle this issue? On Tue, Aug 19, 2014 at 12:02 PM, Aaron wrote: > These three lines of python code cause the error for me: > > sc = SparkContext(appName="foo") > in

Re: spark - reading hfds files every 5 minutes

2014-08-19 Thread salemi
Thank you but how do you convert the stream to parquet file? Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-spark-reading-hfds-files-every-5-minutes-tp12359p12401.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: spark - Identifying and skipping processed data in hdfs

2014-08-19 Thread salemi
I like to read those file as they get written and transform the content and write it out as Parquet file -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-Identifying-and-skipping-processed-data-in-hdfs-tp12347p12402.html Sent from the Apache Spark User

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
(+user) On Tue, Aug 19, 2014 at 12:05 PM, spr wrote: > I want to assign each vertex to a community with the name of the vertex. As I understand it, you want to set the vertex attributes of a graph to the corresponding vertex ids. You can do this using Graph#mapVertices [1] as follows: val

Re: EC2 instances missing SSD drives randomly?

2014-08-19 Thread Jey Kottalam
I think you have to explicitly list the ephemeral disks in the device map when launching the EC2 instance. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak wrote: > Hi, > > Using the spark 1.0.1 ec2 script I lau

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread spr
ankurdave wrote > val g = ... > val newG = g.mapVertices((id, attr) => id) > // newG.vertices has type VertexRDD[VertexId], or RDD[(VertexId, > VertexId)] Yes, that worked perfectly. Thanks much. One follow-up question. If I just wanted to get those values into a vanilla variable (n

pyspark/yarn and inconsistent number of executors

2014-08-19 Thread Calvin
I've set up a YARN (Hadoop 2.4.1) cluster with Spark 1.0.1 and I've been seeing some inconsistencies with out of memory errors (java.lang.OutOfMemoryError: unable to create new native thread) when increasing the number of executors for a simple job (wordcount). The general format of my submission

RDD Grouping

2014-08-19 Thread TJ Klein
Hi, is there a way such that I can group items in an RDD together such that I can process them using parallelize/map Let's say I have data items with keys 1...1000 e.g. loading RDD = sc. newAPIHadoopFile(...).cache() Now, I would like them to be processed in chunks of e.g. tens chunk1=[0..9],ch

Re: RDD Grouping

2014-08-19 Thread Sean Owen
groupBy seems to be exactly what you want. val data = sc.parallelize(1 to 200) data.groupBy(_ % 10).values.map(...) This would let you process 10 Iterable[Int] in parallel, each of which is 20 ints in this example. It may not make sense to do this in practice, as you'd be shuffling a lot of data

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
At 2014-08-19 12:47:16 -0700, spr wrote: > One follow-up question. If I just wanted to get those values into a vanilla > variable (not a VertexRDD or Graph or ...) so I could easily look at them in > the REPL, what would I do? Are the aggregate data structures inside the > VertexRDD/Graph/... Ar

Re: Naive Bayes

2014-08-19 Thread Phuoc Do
Hi Xiangrui, Training data: 42945 "s" out of 124659. Test data: 42722 "s" out of 125341. The ratio is very much the same. I tried Decision Tree. It outputs 0 to 1 decimals. I don't quite understand it yet. Would feature scaling make it work for Naive Bayes? Phuoc Do On Tue, Aug 19, 2014 at 12

Only master is really busy at KMeans training

2014-08-19 Thread durin
When trying to use KMeans.train with some large data and 5 worker nodes, it would due to BlockManagers shutting down because of timeout. I was able to prevent that by adding spark.storage.blockManagerSlaveTimeoutMs 300 to the spark-defaults.conf. However, with 1 Million feature vectors, the

saveAsTextFile hangs with hdfs

2014-08-19 Thread David
I have a simple spark job that seems to hang when saving to hdfs. When looking at the spark web ui, the job reached 97 of 100 tasks completed. I need some help determining why the job appears to hang. The job hangs on the "saveAsTextFile()" call. https://www.dropbox.com/s/fdp7ck91hhm9w68/Scree

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2014-08-19 Thread kmatzen
Reviving this thread hoping I might be able to get an exact snippet for the correct way to do this in Scala. I had a solution for OpenCV that I thought was correct, but half the time the library was not loaded by time it was needed. Keep in mind that I am completely new at Scala, so you're going

Re: Spark RuntimeException due to Unsupported datatype NullType

2014-08-19 Thread Yin Huai
Hi Rafeeq, I think the following part triggered the bug https://issues.apache.org/jira/browse/SPARK-2908. [{*"href":null*,"rel":"me"}] It has been fixed. Can you try spark master and see if the error get resolved? Thanks, Yin On Mon, Aug 11, 2014 at 3:53 AM, rafeeq s wrote: > Hi, > > *Spar

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking this issue. On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo wrote: > Thanks, Zhan for the follow up. > > But, do you know how I am supposed to set that table name on the jobConf? I > don't have access to that object from

Re: spark error when distinct on more than one cloume

2014-08-19 Thread Yin Huai
Hi, The SQLParser used by SQLContext is pretty limited. Instead, can you try HiveContext? Thanks, Yin On Tue, Aug 19, 2014 at 7:57 AM, wan...@testbird.com wrote: > > sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group > by app_id > > > *Error Log* > > 14/08/19 17:58:26 IN

spark streaming - how to prevent that empty dstream to get written out to hdfs

2014-08-19 Thread salemi
Hi All, I have the following code and if the dstream is empty spark streaming writes empty files ti hdfs. How can I prevent it? val ssc = new StreamingContext(sparkConf, Minutes(1)) val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.saveAsNewAPIHadoop

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Cesar Arevalo
Thanks! Yeah, it may be related to that. I'll check out that pull request that was sent and hopefully that fixes the issue. I'll let you know, after fighting with this issue yesterday I had decided to just leave it on the side and return to it after, so it may take me a while to get back to you. -

Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
update: hangs even when not writing to hdfs. I changed the code to avoid saveAsTextFile() and instead do a forEachParitition and log the results. This time it hangs at 96/100 tasks, but still hangs. I changed the saveAsTextFile to: stringIntegerJavaPairRDD.foreachPartition(p -> {

Re: saveAsTextFile hangs with hdfs

2014-08-19 Thread evadnoob
Not sure if this is helpful or not, but in one executor "stderr" log, I found this: 14/08/19 20:17:04 INFO CacheManager: Partition rdd_5_14 not found, computing it 14/08/19 20:17:04 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/08/1

spark-submit with Yarn

2014-08-19 Thread Arun Ahuja
Is there more documentation on using spark-submit with Yarn? Trying to launch a simple job does not seem to work. My run command is as follows: /opt/cloudera/parcels/CDH/bin/spark-submit \ --master yarn \ --deploy-mode client \ --executor-memory 10g \ --driver-memory 10g \ --

Re: spark-submit with Yarn

2014-08-19 Thread Marcelo Vanzin
On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja wrote: > /opt/cloudera/parcels/CDH/bin/spark-submit \ > --master yarn \ > --deploy-mode client \ This should be enough. > But when I view the job 4040 page, SparkUI, there is a single executor (just > the driver node) and I see the following in

Re: spark-submit with Yarn

2014-08-19 Thread Arun Ahuja
Yes, the application is overwriting it - I need to pass it as argument to the application otherwise it will be set as local. Thanks for the quick reply! Also, yes now the appTrackingUrl is set properly as well, before it just said unassigned. Thanks! Arun On Tue, Aug 19, 2014 at 5:47 PM, Marce

Re: RDD Grouping

2014-08-19 Thread TJ Klein
Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this problem as for groupBy() I need to collect() data before applying parallelize(), which is expensive. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407p12424.html

Multiple column families vs Multiple tables

2014-08-19 Thread Wei Liu
We are doing schema design for our application, One thing we are not so clear about is multiple column families (more than 3, probably 5 - 8) vs multiple tables. In our use case, we will have the same number of rows in all these column families, but some column families may be modified more often t

Re: sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-19 Thread chutium
it is definitively a bug, sqlContext.parquetFile should take both dir and single file as parameter. this if-check for isDir make no sense after this commit https://github.com/apache/spark/pull/1370/files#r14967550 i opened a ticket for this issue https://issues.apache.org/jira/browse/SPARK-3138

Re: spark-submit with Yarn

2014-08-19 Thread Andrew Or
>>> The "--master" should override any other ways of setting the Spark master. Ah yes, actually you can set "spark.master" directly in your application through SparkConf. Thanks Marcelo. 2014-08-19 14:47 GMT-07:00 Marcelo Vanzin : > On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja wrote: > > /opt/c

Re: Processing multiple files in parallel

2014-08-19 Thread SK
Without the sc.union, my program crashes with the following error: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndInde

Re: Task's "Scheduler Delay" in web ui

2014-08-19 Thread Chris Fregly
"Scheduling Delay" is the time required to assign a task to an available resource. if you're seeing large scheduler delays, this likely means that other jobs/tasks are using up all of the resources. here's some more info on how to setup Fair Scheduling versus the default FIFO Scheduler: https://

First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Haoyuan Li
Hi folks, We've posted the first Tachyon meetup, which will be on August 25th and is hosted by Yahoo! (Limited Space): http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there! Best, Haoyuan -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/

Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Christopher Nguyen
Fantastic! Sent while mobile. Pls excuse typos etc. On Aug 19, 2014 4:09 PM, "Haoyuan Li" wrote: > Hi folks, > > We've posted the first Tachyon meetup, which will be on August 25th and is > hosted by Yahoo! (Limited Space): > http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you ther

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Chris Fregly
this would be awesome. did a jira get created for this? I searched, but didn't find one. thanks! -chris On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani wrote: > Thanks a lot Xiangrui. This will help. > > > On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng wrote: > >> Hi Rahul, >> >> We plan to

Decision tree: categorical variables

2014-08-19 Thread Sameer Tilak
Hi All, Is there any example of MLlib decision tree handling categorical variables? My dataset includes few categorical variables (20 out of 100 features) so was interested in knowing how I can use the current version of decision tree implementation to handle this situation? I looked at the Labe

High-Level Implementation Documentation

2014-08-19 Thread Kenny Ballou
Hey all, Other than reading the source (not a bad idea in and of iteself; something I will get to soon) I was hoping to find some high-level implementation documentation. Can anyone point me to such a document(s)? Thank you in advance. -Kenny -- :SIG:!0x1066BA71A5F56C58!: -

Re: spark-submit with HA YARN

2014-08-19 Thread Sandy Ryza
Hi Matt, I checked in the YARN code and I don't see any references to yarn.resourcemanager.address. Have you made sure that your YARN client configuration on the node you're launching from contains the right configs? -Sandy On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell wrote: > Hello, > > I

Re: slower worker node in the cluster

2014-08-19 Thread Chris Fregly
perhaps creating Fair Scheduler Pools might help? there's no way to pin certain nodes to a pool, but you can specify minShares (cpu's). not sure if that would help, but worth looking in to. On Tue, Jul 8, 2014 at 7:37 PM, haopu wrote: > In a standalone cluster, is there way to specify the sta

Re: Decision tree: categorical variables

2014-08-19 Thread Xiangrui Meng
The categorical features must be encoded into indices starting from 0: 0, 1, ..., numCategories - 1. Then you can provide the categoricalFeatureInfo map to specify which columns contain categorical features and the number of categories in each. Joseph is updating the user guide. But if you want to

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Xiangrui Meng
No. Please create one but it won't be able to catch the v1.1 train. -Xiangrui On Tue, Aug 19, 2014 at 4:22 PM, Chris Fregly wrote: > this would be awesome. did a jira get created for this? I searched, but > didn't find one. > > thanks! > > -chris > > > On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhoj

Re: Multiple column families vs Multiple tables

2014-08-19 Thread chutium
ö_ö you should send this message to hbase user list, not spark user list... but i can give you some personal advice about this, keep column families as few as possible! at least, use some prefix of column qualifier could also be an idea. but read performance may be worse for your use case like "

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin wrote: > When trying to us

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
The ratio should be okay. Could you try to pre-process the data and map -999.0 to 0 before calling NaiveBayes? Btw, I added a check to ensure nonnegative features values: https://github.com/apache/spark/pull/2038 -Xiangrui On Tue, Aug 19, 2014 at 1:39 PM, Phuoc Do wrote: > Hi Xiangrui, > > Train

Re: Multiple column families vs Multiple tables

2014-08-19 Thread Ted Yu
bq. does not do well with anything above two or three column families Current hbase releases, such as 0.98.x, would do better than the above. 5 column families should be accommodated. Cheers On Tue, Aug 19, 2014 at 3:06 PM, Wei Liu wrote: > We are doing schema design for our application, One

Re: Multiple column families vs Multiple tables

2014-08-19 Thread Wei Liu
Chutium, thanks for your advices. I will check out your links. I sent the email to the wrong email address! Sorry for the spam. Wei On Tue, Aug 19, 2014 at 4:49 PM, chutium wrote: > ö_ö you should send this message to hbase user list, not spark user > list... > > but i can give you some pers

Re: Is hive UDF are supported in HiveContext

2014-08-19 Thread chutium
there is no collect_list in hive 0.12 try this after this ticket is done https://issues.apache.org/jira/browse/SPARK-2706 i am also looking forward to this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-hive-UDF-are-supported-in-HiveContext-tp12310p12

Spark SQL: Caching nested structures extremely slow

2014-08-19 Thread Evan Chan
Hey guys, I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines. I am working with a subset of the GDELT dataset (57 columns, > 250 million rows, but my subset is only 4 million) and trying to query it with Spark SQL. Since a CSV importer isn't available, my first thought was to use nested c

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2014-08-19 Thread Tobias Pfeiffer
Hi, please see this post: http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p11009.html Where it says "some setup code here", you could add your code to load the library. Note, however, that this is not once per node, but once per partition, so it might b

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Tobias Pfeiffer
Hi, On Tue, Aug 19, 2014 at 7:01 PM, Patrick McGloin wrote: > > I think the type of the data contained in your RDD needs to be a known > case class and not abstract for createSchemaRDD. This makes sense when > you think it needs to know about the fields in the object to create the > schema. > E

Re: Naive Bayes

2014-08-19 Thread Phuoc Do
I replaced -999.0 with 0. Predictions still have same label. Maybe negative feature really messes it up. On Tue, Aug 19, 2014 at 4:51 PM, Xiangrui Meng wrote: > The ratio should be okay. Could you try to pre-process the data and > map -999.0 to 0 before calling NaiveBayes? Btw, I added a check

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Evan Chan
That might not be enough. Reflection is used to determine what the fields are, thus your class might actually need to have members corresponding to the fields in the table. I heard that a more generic method of inputting stuff is coming. On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer wrote: >

What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-19 Thread guxiaobo1982
Hi, From the documentation I think only the model fitting part is implement, what about the various hypothesis test and performance indexes used to evaluate the model fit? Regards, Xiaobo Gu

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Amit Kumar
Hi Evan, Patrick and Tobias, So, It worked for what I needed it to do. I followed Yana's suggestion of using parameterized type of [T <: Product:ClassTag:TypeTag] more concretely, I was trying to make the query process a bit more fluent -some pseudocode but with correct types val table:SparkTa

Re: OutOfMemory Error

2014-08-19 Thread Ghousia
Hi, Any further info on this?? Do you think it would be useful if we have a in memory buffer implemented that stores the content of the new RDD. In case the buffer reaches a configured threshold, content of the buffer are spilled to the local disk. This saves us from OutOfMememory Error. Appreci

JVM heap and native allocation questions

2014-08-19 Thread kmatzen
I'm trying to use Spark to process some data using some native function's I've integrated using JNI and I pass around a lot of memory I've allocated inside these functions. I'm not very familiar with the JVM, so I have a couple of questions. (1) Performance seemed terrible until I LD_PRELOAD'ed l

Re: What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-19 Thread Xiangrui Meng
We implemented chi-squared tests in v1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166 and we will add more after v1.1. Feedback on which tests should come first would be greatly appreciated. -Xiangrui On Tue, Aug 19, 2014 at 9:

Got NotSerializableException when access broadcast variable

2014-08-19 Thread 田毅
Hi everyone! I got a exception when i run my script with spark-shell: I added SPARK_JAVA_OPTS="-Dsun.io.serialization.extendedDebugInfo=true" in spark-env.sh to show the following stack: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.

Re: Performance problem on collect

2014-08-19 Thread Emmanuel Castanier
It did the job. Thanks. :) Le 19 août 2014 à 10:20, Sean Owen a écrit : > In that case, why not collectAsMap() and have the whole result as a > simple Map in memory? then lookups are trivial. RDDs aren't > distributed maps. > > On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier > wrote: >> Th