from:"SK"

RandomForest in Pyspark (version 1.4.1)

2015-07-31 Thread SK

Hi, I tried to develop a RandomForest model for my data in PySpark as follows: rf_model = RandomForest.trainClassifier(train_idf, 2, {}, numTrees=15, seed=144) print "RF: Num trees = %d, Num nodes = %d\n" %(rf_model.numTrees(), rf_model.totalNumNodes()) pred_label = test_id

Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-06 Thread SK

Hi, I need to access data on S3 from another account and I have been given the IAM role information to access that S3 bucket. From what I understand, AWS allows us to attach a role to a resource at the time it is created. However, I don't see an option for specifying the role using the spark_ec2.p

Parsing nested json objects with variable structure

2015-08-31 Thread SK

Hi, I need to parse a json input file where the nested objects take on a different structure based on the typeId field, as follows: { "d": { "uid" : "12345" "contents": [{"info": {"eventId": "event1"}, "typeId": 19}] } } { "d": { "uid" : "56780"

save a histogram to a file

2015-01-22 Thread SK

Hi, histogram() returns an object that is a pair of Arrays. There appears to be no saveAsTextFile() for this paired object. Currently I am using the following to save the output to a file: val hist = a.histogram(10) val arr1 = sc.parallelize(hist._1).saveAsTextFile("file1") val arr2 = sc.parall

K-Means final cluster centers

2015-02-05 Thread SK

Hi, I am trying to get the final cluster centers after running the KMeans algorithm in MLlib in order to characterize the clusters. But the KMeansModel does not have any public method to retrieve this info. There appears to be only a private method called clusterCentersWithNorm. I guess I could c

MLLib: feature standardization

2015-02-06 Thread SK

Hi, I have a dataset in csv format and I am trying to standardize the features before using k-means clustering. The data does not have any labels but has the following format: s1, f12,f13,... s2, f21,f22,... where s is a string id, and f is a floating point feature value. To perform feature stan

rdd filter

2015-02-09 Thread SK

Hi, I am using the filter() method to separate the rdds based on a predicate,as follows: val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0}) // t._2 is a Double val rdd2 = data.filter (t => { t._2 >1.0 && t._2 <= 4.0}) val rdd3 = data.filter (t => { t._2 >0.0 && t._2 <= 4.0}) // this sho

column expression in left outer join for DataFrame

2015-03-24 Thread SK

Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at h

filter expression in API document for DataFrame

2015-03-24 Thread SK

The following statement appears in the Scala API example at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame people.filter("age" > 30). I tried this example and it gave a compilation error. I think this needs to be changed to people.filter(people("age") >

Errors when building Spark with sbt

2014-06-09 Thread SK

I tried to use "sbt/sbt assembly" to build spark-1.0.0. I get the a lot of Server access error: Connection refused errors when it tries to download from repo.eclipse.org and repository,jboss.org. I tried to navigate to these links manually and some of these links are obsolete (Error 404). Conside

groupBy question

2014-06-10 Thread SK

After doing a groupBy operation, I have the following result: val res = ("ID1",ArrayBuffer((145804601,"ID1","japan"))) ("ID3",ArrayBuffer((145865080,"ID3","canada"), (145899640,"ID3","china"))) ("ID2",ArrayBuffer((145752760,"ID2","usa"), (145934200,"ID2","usa"))) Now I need

Re: groupBy question

2014-06-10 Thread SK

Great, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-question-tp7357p7360.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

output tuples in CSV format

2014-06-10 Thread SK

My output is a set of tuples and when I output it using saveAsTextFile, my file looks as follows: (field1_tup1, field2_tup1, field3_tup1,...) (field1_tup2, field2_tup2, field3_tup2,...) In Spark. is there some way I can simply have it output in CSV format as follows (i.e. without the parentheses)

json parsing with json4s

2014-06-11 Thread SK

I have the following piece of code that parses a json file and extracts the age and TypeID val p = sc.textFile(log_file) .map(line => { parse(line) }) .map(json => { val v1 = json \ "person" \ "age" val v2 = js

overwriting output directory

2014-06-12 Thread SK

Hi, When we have multiple runs of a program writing to the same output file, the execution fails if the output directory already exists from a previous run. Is there some way we can have it overwrite the existing directory, so that we dont have to manually delete it after each run? Thanks for you

specifying fields for join()

2014-06-12 Thread SK

Hi, I want to join 2 rdds on specific fields. The first RDD is a set of tuples of the form: (ID, ACTION, TIMESTAMP, LOCATION) The second RDD is a set of tuples of the form: (ID, TIMESTAMP). rdd2 is a subset of rdd1. ID is a string. I want to join the two so that I can get the location correspo

Re: specifying fields for join()

2014-06-12 Thread SK

This issue is resolved. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/specifying-fields-for-join-tp7528p7544.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: specifying fields for join()

2014-06-13 Thread SK

I used groupBy to create the keys for both RDDs. Then I did the join. I think though it be useful if in the future Spark could allows us to specify the fields on which to join, even when the keys are different. Scalding allows this feature. -- View this message in context: http://apache-spark-

guidance on simple unit testing with Spark

2014-06-13 Thread SK

Hi, I have looked through some of the test examples and also the brief documentation on unit testing at http://spark.apache.org/docs/latest/programming-guide.html#unit-testing, but still dont have a good understanding of writing unit tests using the Spark framework. Previously, I have written uni

convert List to RDD

2014-06-13 Thread SK

Hi, I have a List[ (String, Int, Int) ] that I would liek to convert to an RDD. I tried to use sc.parallelize and sc.makeRDD, but in each case the original order of items in the List gets modified. Is there a simple way to convert a List to RDD without using SparkContext? thanks -- View this

Re: convert List to RDD

2014-06-13 Thread SK

Thanks. But that did not work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/convert-List-to-RDD-tp7606p7609.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

printing in unit test

2014-06-13 Thread SK

Hi, My unit test is failing (the output is not matching the expected output). I would like to printout the value of the output. But rdd.foreach(r=>println(r)) does not work from the unit test. How can I print or write out the output to a file/screen? thanks. -- View this message in context: h

Set comparison

2014-06-16 Thread SK

Hi, I have a Spark method that returns RDD[String], which I am converting to a set and then comparing it to the expected output as shown in the following code. 1. val expected_res = Set("ID1", "ID2", "ID3") // expected output 2. val result:RDD[String] = getData(input) //method returns RDD[Str

Re: Set comparison

2014-06-16 Thread SK

In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the quotes during comparison. Anyway I modified expected_res = Set("\"ID1\"", "\"ID2\"", "\"ID3\"") and that seems to work. thanks. -- View this message in context: http://apache-spark-user-list.1001

Unit test failure: Address already in use

2014-06-17 Thread SK

Hi, I have 3 unit tests (independent of each other) in the /src/test/scala folder. When I run each of them individually using: sbt "test-only ", all the 3 pass the test. But when I run them all using "sbt test", then they fail with the warning below. I am wondering if the binding exception results

Error when running unit tests

2014-06-23 Thread SK

I am using Spark 1.0.0. I am able to successfully run "sbt package". However, when I run "sbt test" or "sbt test-only ", I get the following error: [error] error while loading , zip file is empty scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. T

scopt.OptionParser

2014-06-27 Thread SK

Hi, I tried to develop some code to use Logistic Regression, following the code in BinaryClassification.scala in examples/mllib. My code compiles, but at runtime complains that scopt/OptionParser class cannot be found. I have the following import statement in my code: import scopt.OptionParser

incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread SK

Hi, I have a csv data file, which I have organized in the following format to be read as a LabeledPoint(following the example in mllib/data/sample_tree_data.csv): 1,5.1,3.5,1.4,0.2 1,4.9,3,1.4,0.2 1,4.7,3.2,1.3,0.2 1,4.6,3.1,1.5,0.2 The first column is the binary label (1 or 0) and the remainin

Decision tree classifier in MLlib

2014-07-11 Thread SK

Hi, I have a small dataset (120 training points, 30 test points) that I am trying to classify into binary classes (1 or 0). The dataset has 4 numerical features and 1 binary label (1 or 0). I used LogisticRegression and SVM in MLLib and I got 100% accuracy in both cases. But when I used Decision

Streaming training@ Spark Summit 2014

2014-07-11 Thread SK

Hi, I tried out the streaming program on the Spark training web page. I created a Twitter app as per the instructions (pointing to http://www.twitter.com). When I run the program, my credentials get printed out correctly but thereafter, my program just keeps waiting. It does not print out the hash

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread SK

I dont get any exceptions or error messages. I tried it both with and without VPN and had the same outcome. But I can try again without VPN later today and report back. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summi

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread SK

I dont have a proxy server. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9481.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

ML classifier and data format for dataset with variable number of features

2014-07-11 Thread SK

Hi, I need to perform binary classification on an image dataset. Each image is a data point described by a Json object. The feature set for each image is a set of feature vectors, each feature vector corresponding to a distinct object in the image. For example, if an image has 5 objects, its featu

Parsing Json object definition spanning multiple lines

2014-07-14 Thread SK

Hi, I have a json file where the definition of each object spans multiple lines. An example of one object definition appears below. { "name": "16287e9cdf", "width": 500, "height": 325, "width": 1024, "height": 665, "obj": [ { "x": 395.08, "y": 82.09,

jsonRDD: NoSuchMethodError

2014-07-14 Thread SK

Hi, I am using Spark 1.0.1. I am using the following piece of code to parse a json file. It is based on the code snippet in the SparkSQL programming guide. However, the compiler outputs an error stating: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.jsonRDD(Lorg/apache/spark/rdd/R

Re: jsonRDD: NoSuchMethodError

2014-07-15 Thread SK

I am running this in standalone mode on a single machine. I built the spark jar from scratch (sbt assembly) and then included that in my application (the same process I have done for earlier versions). thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: jsonRDD: NoSuchMethodError

2014-07-15 Thread SK

The problem is resolved. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Help with Json array parsing

2014-07-15 Thread SK

Hi, I have a json file where the object definition in each line includes an array component "obj" that contains 0 or more elements as shown by the example below. {"name": "16287e9cdf", "obj": [{"min": 50,"max": 59 }, {"min": 20, "max": 29}]}, {"name": "17087e9cdf", "obj": [{"min": 30,"max": 3

Re: Help with Json array parsing

2014-07-15 Thread SK

To add to my previous post, the error at runtime is teh following: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: org.json4s.package$MappingException: Expect

Kmeans: set initial centers explicitly

2014-07-24 Thread SK

Hi, The mllib.clustering.kmeans implementation supports a random or parallel initialization mode to pick the initial centers. is there a way to specify the initial centers explictly? It would be useful to have a setCenters() method where we can explicitly specify the initial centers. (For e.g. R

Re: Decision tree classifier in MLlib

2014-07-25 Thread SK

yes, the output is continuous. So I used a threshold to get binary labels. If prediction < threshold, then class is 0 else 1. I use this binary label to then compute the accuracy. Even with this binary transformation, the accuracy with decision tree model is low compared to LR or SVM (for the spec

evaluating classification accuracy

2014-07-28 Thread SK

Hi, In order to evaluate the ML classification accuracy, I am zipping up the prediction and test labels as follows and then comparing the pairs in predictionAndLabel: val prediction = model.predict(test.map(_.features)) val predictionAndLabel = prediction.zip(test.map(_.label)) However, I am fi

Re: evaluating classification accuracy

2014-07-30 Thread SK

I am using 1.0.1 and I am running locally (I am not providing any master URL). But the zip() does not produce the correct count as I mentioned above. So not sure if the issue has been fixed in 1.0.1. However, instead of using zip, I am now using the code that Sean has mentioned and am getting the c

Re: Decision Tree requires regression LabeledPoint

2014-07-30 Thread SK

I have also used labeledPoint or libSVM format (for sparse data) for DecisionTree. When I had categorical labels (not features), I mapped the categories to numerical data as part of the data transformation step (i.e. before creating the LabeledPoint). -- View this message in context: http://apa

Extracting an element from the feature vector in LabeledPoint

2014-07-31 Thread SK

Hi, I want to extract the individual elements of a feature vector that is part of a LabeledPoint. I tried the following: data.features._1 data.features(1) data.features.map(_.1) data is a LabeledPoint with a feature vector containing 3 features. All of the above resulted in compilation errors

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-01 Thread SK

I am using 1.0.1. It does not matter to me whether it is the first or second element. I would like to know how to extract the i-th element in the feature vector (not the label). data.features(i) gives the following error: method apply in trait Vector cannot be accessed in org.apache.spark.mllib.l

correct upgrade process

2014-08-01 Thread SK

Hi, I upgraded to 1.0.1 from 1.0 a couple of weeks ago and have been able to use some of the features advertised in 1.0.1. However, I get some compilation errors in some cases and based on user response, these errors have been addressed in the 1.0.1 version and so I should not be getting these er

Re: correct upgrade process

2014-08-01 Thread SK

Hi, So I again ran "sbt clean" followed by all of the steps listed above to rebuild the jars after cleaning. My compilation error still persists. Specifically, I am trying to extract an element from the feature vector that is part of a LabeledPoint as follows: data.features(i) This gives the fo

Naive Bayes parameters

2014-08-06 Thread SK

1) How is the minPartitions parameter in NaiveBayes example used? What is the default value? 2) Why is the numFeatures specified as a parameter? Can this not be obtained from the data? This parameter is not specified for the other MLlib algorithms. thanks -- View this message in context:

Regularization parameters

2014-08-06 Thread SK

Hi, I tried different regularization parameter values with Logistic Regression for binary classification of my dataset and would like to understand the following results: regType = L2, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 80% regType = L1, regParam = 0.0 , I am getting AUC =

Re: Naive Bayes parameters

2014-08-07 Thread SK

I followed the example in examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala. IN this file Params is defined as follows: case class Params ( input: String = null, minPartitions: Int = 0, numFeatures: Int = -1, lambda: Double = 1.0) In the main functio

Re: Regularization parameters

2014-08-07 Thread SK

Hi, I am following the code in examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala For setting the parameters and parsing the command line options, I am just reusing that code.Params is defined as follows. case class Params( input: String = null, numIt

Re: Regularization parameters

2014-08-07 Thread SK

Spark 1.0.1 thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601p11631.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - T

Re: Naive Bayes parameters

2014-08-07 Thread SK

Ok, thanks for clarifying. So looks like numFeatures is only relevant for lib SVM format. I am using LabeledPoint, so if data is not sparse, perhaps numFeatures is not required. I thought that the Params class defines all the parameters passed to the ML algorithm. But it looks like it also include

Re: Regularization parameters

2014-08-07 Thread SK

What is the definition of regParam and what is the range of values it is allowed to take? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601p11737.html Sent from the Apache Spark User List mailing list archive at Nabbl

[MLLib]:choosing the Loss function

2014-08-07 Thread SK

Hi, According to the MLLib guide, there seems to be support for different loss functions. But I could not find a command line parameter to choose the loss function but only found regType to choose the regularization. Does MLLib support a parameter to choose the loss function? thanks -- View t

Re: scopt.OptionParser

2014-08-08 Thread SK

i was using sbt package when I got this error. Then I switched to using sbt assembly and that solved the issue. To run "sbt assembly", you need to have a file called plugins.sbt in the "/project" directory and it has the following line: addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2") T

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread SK

Hi, Thanks for the reference to the LBFGS optimizer. I tried to use the LBFGS optimizer, but I am not able to pass it as an input to the LogisticRegression model for binary classification. After studying the code in mllib/classification/LogisticRegression.scala, it appears that the only impleme

Re: Spark webUI - application details page

2014-08-14 Thread SK

Hi, I am using Spark 1.0.1. But I am still not able to see the stats for completed apps on port 4040 - only for running apps. Is this feature supported or is there a way to log this info to some file? I am interested in stats about the total # of executors, total runtime, and total memory used by

Re: Spark webUI - application details page

2014-08-14 Thread SK

I set "spark.eventLog.enabled" to true in $SPARK_HOME/conf/spark-defaults.conf and also configured the logging to a file as well as console in log4j.properties. But I am not able to get the log of the statistics in a file. On the console there is a lot of log messages along with the stats - so ha

Re: Spark webUI - application details page

2014-08-14 Thread SK

More specifically, as indicated by Patrick above, in 1.0+, apps will have persistent state so that the UI can be reloaded. Is there a way to enable this feature in 1.0.1? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details

Re: Spark webUI - application details page

2014-08-15 Thread SK

Hi, Ok, I was specifying --master local. I changed that to --master spark://:7077 and am now able to see the completed applications. It provides summary stats about runtime and memory usage, which is sufficient for me at this time. However it doesn't seem to archive the info in the "application

Extracting unique elements of an ArrayBuffer

2014-08-18 Thread SK

Hi, I have a piece of code in which the result of a groupByKey operation is as follows: (2013-04, ArrayBuffer(s1, s2, s3, s1, s2, s4)) The first element is a String value representing a date and the ArrayBuffer consists of (non-unique) strings. I want to extract the unique elements of the Array

Processing multiple files in parallel

2014-08-18 Thread SK

Hi, I have a piece of code that reads all the (csv) files in a folder. For each file, it parses each line, extracts the first 2 elements from each row of the file, groups the tuple by the key and finally outputs the number of unique values for each key. val conf = new SparkConf().setAppNam

Re: Processing multiple files in parallel

2014-08-19 Thread SK

Without the sc.union, my program crashes with the following error: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndInde

Re: Spark webUI - application details page

2014-08-25 Thread SK

Hi, I am able to access the Application details web page from the master UI page when I run Spark in standalone mode on my local machine. However, I am not able to access it when I run Spark on our private cluster. The Spark master runs on one of the nodes in the cluster. I am able to access the

Re: Spark webUI - application details page

2014-08-26 Thread SK

I have already tried setting the history server and accessing it on :18080 as per the link. But the page does not list any completed applications. As I mentioned in my previous mail, I am running Spark in standalone mode on the cluster (as well as on my local machine). According to the link, it ap

OutofMemoryError when generating output

2014-08-26 Thread SK

Hi, I have the following piece of code that I am running on a cluster with 10 nodes with 2GB memory per node. The tasks seem to complete, but at the point where it is generating output (saveAsTextFile), the program freezes after some time and reports an out of memory error (error transcript attach

Re: Spark webUI - application details page

2014-08-28 Thread SK

I was able to recently solve this problem for standalone mode. For this mode, I did not use a history server. Instead, I set spark.eventLog.dir (in conf/spark-defaults.conf) to a directory in hdfs (basically this directory should be in a place that is writable by the master and accessible globally

Re: OutofMemoryError when generating output

2014-08-28 Thread SK

Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line => val field

transforming a Map object to RDD

2014-08-28 Thread SK

Hi, How do I convert a Map object to an RDD so that I can use the saveAsTextFile() operation to output the Map object? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/transforming-a-Map-object-to-RDD-tp13071.html Sent from the Apache Spark User Li

Memory statistics in the Application detail UI

2014-08-28 Thread SK

Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for differe

Re: Memory statistics in the Application detail UI

2014-08-28 Thread SK

Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 17.3 KB ? is that the memory used for shuffling operations? For non

Re: Too many open files

2014-08-29 Thread SK

Hi, I am having the same problem reported by Michael. I am trying to open 30 files. ulimit -n shows the limit is 1024. So I am not sure why the program is failing with "Too many open files" error. The total size of all the 30 files is 230 GB. I am running the job on a cluster with 10 nodes, eac

mllib performance on cluster

2014-09-02 Thread SK

Hi, I evaluated the runtime performance of some of the MLlib classification algorithms on a local machine and a cluster with 10 nodes. I used standalone mode and Spark 1.0.1 in both cases. Here are the results for the total runtime: Local Cluster Logi

Re: mllib performance on cluster

2014-09-02 Thread SK

NUm Iterations: For LR and SVM, I am using the default value of 100. All the other parameters also I am using the default values. I am pretty much reusing the code from BinaryClassification.scala. For Decision Tree, I dont see any parameter for number of iterations inthe example code, so I did

Re: mllib performance on cluster

2014-09-02 Thread SK

The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context: http://apache-spark-user-lis

Spark Web UI in Mesos mode

2014-09-08 Thread SK

Hi, I am running Spark 1.0.2 on a cluster in Mesos mode. I am not able to access the Spark master Web UI at port 8080 but am able to access it at port 5050. Is 5050 the standard port? Also, in the the standalone mode, there is a link to the Application detail UI directly from the master UI. I do

History server: ERROR ReplayListenerBus: Exception in parsing Spark event log

2014-09-11 Thread SK

Hi, I am using Spark 1.0.2 on a mesos cluster. After I run my job, when I try to look at the detailed application stats using a history server@18080, the stats don't show up for some of the jobs even though the job completed successfully and the event logs are written to the log folder. The log f

Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread SK

Hi, I am using the Spark 1.1.0 version that was released yesterday. I recompiled my program to use the latest version using "sbt assembly" after modifying .sbt to use the 1.1.0 version. The compilation goes through and the jar is built. When I run the jar using spark-submit, I get an error: "Canno

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread SK

This issue is resolved. Looks like in the new spark-submit, the jar path has to be at the end of the options. Earlier I could specify this path in any order on the command line. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load

Unable to load app logs for MLLib programs in history server

2014-09-18 Thread SK

Hi, The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032". When I click on the program on the hi

mllib performance on mesos cluster

2014-09-19 Thread SK

Hi, I have a program similar to the BinaryClassifier example that I am running using my data (which is fairly small). I run this for 100 iterations. I observed the following performance: Standalone mode cluster with 10 nodes (with Spark 1.0.2): 5 minutes Standalone mode cluster with 10 nodes (wi

Re: Unable to load app logs for MLLib programs in history server

2014-09-19 Thread SK

I have created JIRA ticket 3610 for the issue. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-app-logs-for-MLLib-programs-in-history-server-tp14627p14706.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Spark Streaming compilation error: algebird not a member of package com.twitter

2014-09-19 Thread SK

Hi, I am using the latest release Spark 1.1.0. I am trying to build the streaming examples (under examples/streaming) as a standalone project with the following streaming.sbt file. When I run sbt assembly, I get an error stating that object algebird is not a member of package com.twitter. I tried

Streaming: HdfsWordCount does not print any output

2014-09-22 Thread SK

Hi, I tried running the HdfsWordCount program in the streaming examples in Spark 1.1.0. I provided a directory in the distributed filesystem as input. This directory has one text file. However, the only thing that the program keeps printing is the time - but not the word count. I have not used the

Re: Streaming: HdfsWordCount does not print any output

2014-09-22 Thread SK

This issue is resolved. The file needs to be created after the program has started to execute. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-HdfsWordCount-does-not-print-any-output-tp14849p14852.html Sent from the Apache Spark User List

HdfsWordCount only counts some of the words

2014-09-23 Thread SK

Hi, I tried out the HdfsWordCount program in the Streaming module on a cluster. Based on the output, I find that it counts only a few of the words. How can I have it count all the words in the text? I have only one text file in the directory. thanks -- View this message in context: http://a

RE: HdfsWordCount only counts some of the words

2014-09-23 Thread SK

I execute it as follows: $SPARK_HOME/bin/spark-submit --master --class org.apache.spark.examples.streaming.HdfsWordCount target/scala-2.10/spark_stream_examples-assembly-1.0.jar After I start the job, I add a new test file in hdfsdir. It is a large text file which I will not be able to c

Shuffle files

2014-09-25 Thread SK

Hi, I am using Spark 1.1.0 on a cluster. My job takes as input 30 files in a directory (I am using sc.textfile("dir/*") ) to read in the files. I am getting the following warning: WARN TaskSetManager: Lost task 99.0 in stage 1.0 (TID 99, mesos12-dev.sccps.net): java.io.FileNotFoundException: /t

processing large number of files

2014-09-30 Thread SK

Hi, I am trying to compute the number of unique users from a year's worth of data. So there are about 300 files and each file is quite large (~GB). I first tried this without a loop by reading all the files in the directory using the glob pattern: sc.textFile("dir/*"). But the tasks were stalli

spark.cleaner.ttl

2014-10-01 Thread SK

Hi, I am using spark v 1.1.0. The default value of spark.cleaner.ttl is infinite as per the online docs. Since a lot of shuffle files are generated in /tmp/spark-local* and the disk is running out of space, we tested with a smaller value of ttl. However, even when job has completed and the timer

Application details for failed and teminated jobs

2014-10-02 Thread SK

Hi, Currently the history server provides application details for only the successfully completed jobs (where the APPLICATION_COMPLETE file is generated). However, (long-running) jobs that we terminate manually or failed jobs where the APPLICATION_COMPLETE may not be generated, dont show up on th

Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread SK

Hi, I am trying to extract the number of distinct users from a file using Spark SQL, but I am getting the following error: ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 15) java.lang.ArrayIndexOutOfBoundsException: 1 I am following the code in examples/sql/RDDRelation.scala. My co

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread SK

Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along the following lines: sc.textFile(inp_file) .drop(1) // or tail() to drop the header line

Re: Shuffle files

2014-10-07 Thread SK

- We set ulimit to 50. But I still get the same "too many open files" warning. - I tried setting consolidateFiles to True, but that did not help either. I am using a Mesos cluster. Does Mesos have any limit on the number of open files? thanks -- View this message in context: http:/

getting tweets for a specified handle

2014-10-09 Thread SK

Hi, I am using Spark 1.1.0. Is there a way to get the complete tweets corresponding to a handle (for e.g. @Delta)? I tried using the following example that extracts just the hashtags and replaced the "#" with "@" as follows. I need the complete tweet and not just the tags. // val hashTag

Re: getting tweets for a specified handle

2014-10-10 Thread SK

Thanks. I made the change and ran the code. But I dont get any tweets for my handle, although I do see the tweets when I search for it on twitter. Does Spark allow us to get the tweets from the past (say the last 100 tweets? tweets that appeared in the last 10 minutes)? thanks -- View this mess

Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-14 Thread SK

Hi, I am trying to implement simple sentiment analysis of Twitter streams in Spark/Scala. I am getting an exception and it appears when I combine SparkContext with StreamingContext in the same program. When I read the positive and negative words using only SparkContext.textFile (without creating

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread SK

You are right. Creating the StreamingContext from the SparkContext instead of SparkConf helped. Thanks for the help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Sentiment-Analysis-of-Twitter-streams-tp16410p16520.html Sent from the Apache

1 2 >

1 - 100 of 130 matches

Mail list logo