jsonFile function in SQLContext does not work

2014-06-25 Thread durin
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23). I'm trying to execute the following code: import org.apache.spark.SparkContext._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val table = sqlContext.jsonFile("hdfs://host:9100/user/myuser/data.json")

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
(Thread.java:662) Driver stacktrace: Is the only possible reason that some of these 4.3 Million JSON-Objects are not valid JSON, or could there be another explanation? And if it is the reason, is there some way to tell the function to just skip faulty lines? Thanks, Durin -- View this message

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Yin an Aaron, thanks for your help, this was indeed the problem. I've counted 1233 blank lines using grep, and the code snippet below works with those. >From what you said, I guess that skipping faulty lines will be possible in later versions? Kind regards, Simon -- View this message in c

LIMIT with offset in SQL queries

2014-07-02 Thread durin
Hi, in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause, s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5. As far as I can see, this is not possible in Spark-SQL. The best solution I have to imitate that (using Scala) is converting the RDD into an Array via collect

KMeans for large training data

2014-07-11 Thread durin
Hi, I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic clustering with Strings. My code works great when I use a five-figure amount of training elements. However, with for example 2 million elements, it gets extremely slow. A single stage may take up to 30 minutes. >From

Re: KMeans for large training data

2014-07-11 Thread durin
Hi Sean, thanks for you reply. How would you get more partitions? I ran broadcastVector.value.repartition(5), but broadcastVector.value.partitions.size is still 1 and no change to the behavior is visible. Also, I noticed this: First of all, there is a gap of almost two minutes between the third

Re: KMeans for large training data

2014-07-12 Thread durin
Thanks, setting the number of partitions to the number of executors helped a lot and training with 20k entries got a lot faster. However, when I tried training with 1M entries, after about 45 minutes of calculations, I get this: It's stuck at this point. The CPU load for the master is at 100% (

Re: KMeans for large training data

2014-07-12 Thread durin
Your latest response doesn't show up here yet, I only got the mail. I'll still answer here in the hope that it appears later: Which memory setting do you mean? I can go up with spark.executor.memory a bit, it's currently set to 12G. But thats already way more than the whole SchemaRDD of Vectors th

import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
I'm using spark > 1.0.0 (three weeks old build of latest). Along the lines of this tutorial , I want to read some tweets from twitter. When trying to execute in the Spark-Shell, I get The tutorial

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
Thanks. Can I see that a Class is not available in the shell somewhere in the API Docs or do I have to find out by trial and error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html Sent from

KMeans: expensiveness of large vectors

2014-07-24 Thread durin
As a source, I have a textfile with n rows that each contain m comma-separated integers. Each row is then converted into a feature vector with m features each. I've noticed, that given the same total filesize and number of features, a larger number of columns is much more expensive for training a

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the collecti

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context: ht

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-v

sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
As suggested here , I want to create a minimal project using sbt to be able to use org.apache.spark.streaming.twitter in the shell. My Spark version is the latest Master bran

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
Hi Tathagata, I didn't mean to say this was an error. According to the other thread I linked, right now there shouldn't be any conflicts, so I wanted to use streaming in the shell for easy testing. I thought I had to create my own project in which I'd add streaming as a dependency, but if I can a

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
I've added the following to my spark-env.sh: I can now execute without an error in the shell. However, I will get an error when doing this: What am I missing? Do I have to import another jar? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-package

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
Hi Tathagata, I was using the "raw" tag in the web-editor. Seems like this doesn't make it into the mail. Here's the message again, this time without those tags: I've added the following to my spark-env.sh: SPARK_CLASSPATH="/disk.b/spark-master-2014-07-28/external/twitter/target/spark-streamin

Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
I am using the latest Spark master and additionally, I am loading these jars: - spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar - twitter4j-core-4.0.2.jar - twitter4j-stream-4.0.2.jar My simple test program that I execute in the shell looks as follows: import org.apache.spark.streaming._ impo

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j ) changes the error to Exception in thread "Thread-55" java.lang.NoClassDefFoundError: twitter4j/StatusListener at org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
In the WebUI "Environment" tab, the section "Classpath Entries" lists the following ones as part of System Classpath: /foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop /foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar /foo/spark-master-2014-07-28/co

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Hi Tathagata, I got rid of parts of this problem today. It was embarassingly simple: somehow, my synchronisation script had failed and one of the two jars wasn't sent out to all slave nodes. The java error disappears, however, I still don't receive any tweets. This might well be firewall relate

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Update: I can get it to work by disabling iptables temporarily. I can, however, not figure out on which port I have to accept traffic. 4040 and any of the Master or Worker ports mentioned in the previous post don't work. Can it be one of the randomly assigned ones in the 30k to 60k range? Those ap

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
among the nodes in your cluster. -Andrew 2014-08-06 10:23 GMT-07:00 durin < [hidden email] >: <blockquote style='border-left:2px solid #CC;padding:0 1em' class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Update:

Re: KMeans Input Format

2014-08-07 Thread durin
Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache() wit

Executors for Spark shell take much longer to be ready

2014-08-08 Thread durin
I recently moved my Spark installation from one Linux user to another one, i.e. changed the folder and ownership of the files. That was everything, no other settings were changed or different machines used. However, now it suddenly takes three minutes to have all executors in the Spark shell ready

Re: saveAsTextFile

2014-08-10 Thread durin
This should work: jobs.saveAsTextFile("file:home/hysom/testing") Note the 4 slashes, it's really 3 slashes + absolute path. This should be mentioned in the docu though, I only remember that from having seen it somewhere else. The output folder, here "testing", will be created and must theref

Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin
I'm trying to apply KMeans training to some text data, which consists of lines that each contain something between 3 and 20 words. For that purpose, all unique words are saved in a dictionary. This dictionary can become very large as no hashing etc. is done, but it should spill to disk in case it d

Re: Spark webUI - application details page

2014-08-14 Thread durin
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS should achieve what you want. I'm logging to the HDFS, but according to the config page a folder should be possible as well. Example with all other settings rem

Only master is really busy at KMeans training

2014-08-19 Thread durin
When trying to use KMeans.train with some large data and 5 worker nodes, it would due to BlockManagers shutting down because of timeout. I was able to prevent that by adding spark.storage.blockManagerSlaveTimeoutMs 300 to the spark-defaults.conf. However, with 1 Million feature vectors, the

Re: Only master is really busy at KMeans training

2014-08-25 Thread durin
With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is "ExecutorLostFailure (executor lost)". The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT 2000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again lat

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-08-27 Thread durin
Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier. The essential port of the code looks like that: var texts = ctx.sql("SELECT text FROM

Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread durin
Doing a quick Google search, it appears to me that there is a number people who have implemented algorithms for solving systems of (sparse) linear equations on Hadoop MapReduce. However, I can find no such thing for Spark. Has anyone information on whether there are attempts of creating such an

Re: Solving Systems of Linear Equations Using Spark?

2014-09-22 Thread durin
Hey Deb, sorry for the late answer, I've been travelling and don't have much time yet until in a few days. To be precise, it's not me who has to solve the problem, but a person I know well and who I'd like to help with a possibly faster method. I'll try to state the facts as well as I know them,