Re: PySpark 1.2 Hadoop version mismatch

2015-02-12 Thread Sean Owen
No, "mr1" should not be the issue here, and I think that would break other things. The OP is not using mr1. client 4 / server 7 means roughly "client is Hadoop 1.x, server is Hadoop 2.0.x". Normally, I'd say I think you are packaging Hadoop code in your app by brining in Spark and its deps. Your a

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate new data. Are you using the groupBy operator in Spark or are you using Spark SQL's group by? This usually happens if you are grouping or aggregating in a way that doesn't sufficiently condense the data created from each input pa

Re: Can't access remote Hive table from spark

2015-02-12 Thread Zhan Zhang
When you log in, you have root access. Then you can do “su hdfs” or any other account. Then you can create hdfs directory and change permission, etc. Thanks Zhan Zhang On Feb 11, 2015, at 11:28 PM, guxiaobo1982 mailto:guxiaobo1...@qq.com>> wrote: Hi Zhan, Yes, I found there is a hdfs accoun

Re: obtain cluster assignment in K-means

2015-02-12 Thread Robin East
KMeans.train actually returns a KMeansModel so you can use predict() method of the model e.g. clusters.predict(pointToPredict) or clusters.predict(pointsToPredict) first is a single Vector, 2nd is RDD[Vector] Robin On 12 Feb 2015, at 06:37, Shi Yu wrote: > Hi there, > > I am new to spark.

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread fightf...@163.com
Hi, patrick Really glad to get your reply. Yes, we are doing group by operations for our work. We know that this is common for growTable when processing large data sets. The problem actually goes to : Do we have any possible chance to self-modify the initialCapacity using specifically for our

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
Interesting to hear that it works for you. Are you using Yarn 2.2 as well? No strange log message during startup, and can't see any other log messages since no executer gets launched. Does not seems to work in yarn-client mode either, failing with the exception below. Exception in thread "main" or

Re: How to log using log4j to local file system inside a Spark application that runs on YARN?

2015-02-12 Thread Emre Sevinc
Marcelo and Burak, Thank you very much for your explanations. Now I'm able to see my logs. On Wed, Feb 11, 2015 at 7:52 PM, Marcelo Vanzin wrote: > For Yarn, you need to upload your log4j.properties separately from > your app's jar, because of some internal issues that are too boring to > expla

Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
I try to use the multi-thread to use the Spark SQL query. some sample code just like this: val sqlContext = new SqlContext(sc) val rdd_query = sc.parallelize(data, part) rdd_query.registerTempTable("MyTable") sqlContext.cacheTable("MyTable") val serverPool = Executors.newFixedThreadPool(3) val

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread didmar
Ok, I would suggest adding SPARK_DRIVER_MEMORY in spark-env.sh, with a larger amount of memory than the default 512m -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21618.html Sent from

Spark Streaming distributed batch locking

2015-02-12 Thread Legg John
Hi After doing lots of reading and building a POC for our use case we are still unsure as to whether Spark Streaming can handle our use case: * We have an inbound stream of sensor data for millions of devices (which have unique identifiers). * We need to perform aggregation of this stream on a pe

Re: OutofMemoryError: Java heap space

2015-02-12 Thread Yifan LI
Thanks, Kelvin :) The error seems to disappear after I decreased both spark.storage.memoryFraction and spark.shuffle.memoryFraction to 0.2 And, some increase on driver memory. Best, Yifan LI > On 10 Feb 2015, at 18:58, Kelvin Chu <2dot7kel...@gmail.com> wrote: > > Since the stacktrace

Re: Spark Streaming distributed batch locking

2015-02-12 Thread Arush Kharbanda
* We have an inbound stream of sensor data for millions of devices (which have unique identifiers). Spark Streaming can handel events in the ballpark of 100-500K records/sec/node - *so you need to decide on a cluster accordingly. And its scalable.* * We need to perform aggregation of this stream o

Re: Streaming scheduling delay

2015-02-12 Thread Gerard Maas
Hi Tim, >From this: " There are 5 kafka receivers and each incoming stream is split into 40 partitions" I suspect that you're creating too many tasks for Spark to process on time. Could you try some of the 'knobs' I describe here to see if that would help? http://www.virdata.com/tuning-spark/ -

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-12 Thread Rok Roskar
Hi again, I narrowed down the issue a bit more -- it seems to have to do with the Kryo serializer. When I use it, then this results in a Null Pointer: rdd = sc.parallelize(range(10)) d = {} from random import random for i in range(10) : d[i] = random() rdd.map(lambda x: d[x]).collect()

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread poiuytrez
Very interesting. It works. When I set SPARK_DRIVER_MEMORY=83971m in spark-env.sh or spark-default.conf it works. However, when I set the --driver-memory option with spark submit, the memory is not allocated to the spark master. (the web ui shows the correct value of spark.driver.memory (83971m)

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread Sean Owen
Looking at the script, I'm not sure whether --driver-memory is supposed to work in standalone client mode. It's "too late" to set the driver's memory if the driver is what's already running. It specially handles the case where the value is the environment config though. Not sure, this might be on p

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Aniket Bhatnagar
This is tricky to debug. Check logs of node and resource manager of YARN to see if you can trace the error. In the past I have to closely look at arguments getting passed to YARN container (they get logged before attempting to launch containers). If I still don't get a clue, I had to check the scri

apply function to all the elements of a rowMatrix

2015-02-12 Thread Donbeo
Hi, I need to apply a function to all the elements of a rowMatrix. How can I do that? Here there is a more detailed question http://stackoverflow.com/questions/28438908/spark-mllib-apply-function-to-all-the-elements-of-a-rowmatrix Thanks a lot! -- View this message in context: http://apach

Invoking updateStateByKey twice on the same RDD

2015-02-12 Thread harsha
Can I invoke UpdateStateByKey twice on the same RDD. My requirement is as follows. 1. Get the event stream from Kafka 2. UpdateStateByKey to aggregate and filter events based on timestamp 3. Do some processing and save results to Cassandra DB 4. UpdateStateByKey to remove keys based on logout even

Test

2015-02-12 Thread Dima Zhiyanov
Sent from my iPhone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Use of nscala-time within spark-shell

2015-02-12 Thread Hammam
Hi All, Thanks in advance for your help. I have timestamp which I need to convert to datetime using scala. A folder contains the three needed jar files: "joda-convert-1.5.jar joda-time-2.4.jar nscala-time_2.11-1.8.0.jar" Using scala REPL and adding the jars: scala -classpath "*.jar" I can use ns

Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Todd Nist
I have a question with regards to accessing SchemaRDD’s and Spark SQL temp tables via the thrift server. It appears that a SchemaRDD when created is only available in the local namespace / context and are unavailable to external services accessing Spark through thrift server via ODBC; is this corr

Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi All, using Pyspark, I create two RDDs (one with about 2M records (~200MB), the other with about 8M records (~2GB)) of the format (key, value). I've done a partitionBy(num_partitions) on both RDDs and verified that both RDDs have the same number of partitions and that equal keys reside on

saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Hi. I am stuck with how to save file to hdfs from spark. I have written MyOutputFormat extends FileOutputFormat, then in spark calling this: rddres.saveAsHadoopFile[MyOutputFormat]("hdfs://localhost/output") or rddres.saveAsHadoopFile("hdfs://localhost/output", classOf[String], classOf[MyObj

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Ted Yu
You can use JavaPairRDD which has: override def wrapRDD(rdd: RDD[(K, V)]): JavaPairRDD[K, V] = JavaPairRDD.fromRDD(rdd) Cheers On Thu, Feb 12, 2015 at 7:36 AM, Vladimir Protsenko wrote: > Hi. I am stuck with how to save file to hdfs from spark. > > I have written MyOutputFormat extends FileO

failing GraphX application ('GC overhead limit exceeded', 'Lost executor', 'Connection refused', etc.)

2015-02-12 Thread Matthew Cornell
Hi Folks, I'm running a five-step path following-algorithm on a movie graph with 120K verticies and 400K edges. The graph has vertices for actors, directors, movies, users, and user ratings, and my Scala code is walking the path "rating > movie > rating > user > rating". There are 75K rating no

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy evaluation of RDDs. After you do your partitionBy, are you sure those RDDs are actually materialized & cached somewhere? eg., if you just did

Re: [hive context] Unable to query array once saved as parquet

2015-02-12 Thread Ayoub
Hi, as I was trying to find a work around until this bug will be fixed, I discovered an other bug posted here: https://issues.apache.org/jira/browse/SPARK-5775 For those who might had the same issue, one could use the "LOAD" sql command in a hive context to load the parquet file into the table as

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Imran Rashid
You need to import the implicit conversions to PairRDDFunctions with import org.apache.spark.SparkContext._ (note that this requirement will go away in 1.3: https://issues.apache.org/jira/browse/SPARK-4397) On Thu, Feb 12, 2015 at 9:36 AM, Vladimir Protsenko wrote: > Hi. I am stuck with how to

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi Imran, thanks for your quick reply. Actually I am doing this: rddA = rddA.partitionBy(n).cache() rddB = rddB.partitionBy(n).cache() followed by rddA.count() rddB.count() then joinedRDD = rddA.join(rddB) I thought that the count() would force the evaluation, so any subsequ

Re: Shuffle on joining two RDDs

2015-02-12 Thread Sean Owen
Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: > Hi Karlson, > > I think your assumptions are correct -- that join alone shouldn't require > any shuffling. But its possible you are getting tripped up by lazy > evaluation of RDD

Re: obtain cluster assignment in K-means

2015-02-12 Thread Shi Yu
Thanks Robin, got it. On Thu, Feb 12, 2015 at 2:21 AM, Robin East wrote: > KMeans.train actually returns a KMeansModel so you can use predict() > method of the model > > e.g. clusters.predict(pointToPredict) > or > > clusters.predict(pointsToPredict) > > first is a single Vector, 2nd is RDD[Vect

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: Hi Karlson, I think your assumptions are correct

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It could be that pyspark doesn't properly support narrow dependencies, or maybe you need to be more explicit about the partitioner. I am looking into the pyspark api but you might have some better guesses here than I thought.

8080 port password protection

2015-02-12 Thread MASTER_ZION (Jairo Linux)
Hi everyone, Im creating a development machine in AWS and i would like to protect the port 8080 using a password. Is it possible? Best Regards *Jairo Moreno*

spark mllib error when predict on linear regression model

2015-02-12 Thread Donbeo
Hi, I have a model and I am trying to predict regPoints. Here is the code that I have used. A more detailed question is available at http://stackoverflow.com/questions/28482476/spark-mllib-predict-error-with-map scala> model res26: org.apache.spark.mllib.regression.LinearRegressionModel = (weig

Re: 8080 port password protection

2015-02-12 Thread Arush Kharbanda
You could apply a password using a filter using a server. Though it dosnt looks like the right grp for the question. It can be done for spark also for Spark UI. On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux) < master.z...@gmail.com> wrote: > Hi everyone, > > Im creating a development

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38 I am getting the feeling this is an issue w/ pyspark On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid wrote: > ah, sorry I am not too

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
I ran against 2.6, not 2.2. For that yarn-client run, do you have the application master log? On Thu, Feb 12, 2015 at 6:11 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > This is tricky to debug. Check logs of node and resource manager of YARN > to see if you can trace the error. In

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the "spark.kryo.registrator" property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet wrote: > I'm trying to register a custom class that extends Kryo's Serializer > interface. I can't tell exactly what Class the registerKryo

Re: Shuffle on joining two RDDs

2015-02-12 Thread Davies Liu
The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote: > I wonder if the issue is that these lines just need to add > preservesPartitioning = true > ? > > https://github.com/apache/spark/blob/master/python/pyspark/join.py#L

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
I thought more on it...can we provide access to the IndexedRDD through thriftserver API and let the mapPartitions query the API ? I am not sure if ThriftServer is as performant as opening up an API using other akka based frameworks (like play or spray)... Any pointers will be really helpful... Ne

Master dies after program finishes normally

2015-02-12 Thread Manas Kar
Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fat

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
The thing is that my time is perfectly valid... On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das wrote: > Its with the timezone actually, you can either use an NTP to maintain > accurate system clock or you can adjust your system time to match with the > AWS one. You can do it as: > > telnet s3.amazo

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
What is your cluster configuration? Did you try looking at the Web UI? There are many tips here http://spark.apache.org/docs/1.2.0/tuning.html Did you try these? On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar wrote: > Hi, > I have a Hidden Markov Model running with 200MB data. > Once the progra

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
Hi Arush, Mine is a CDH5.3 with Spark 1.2. The only change to my spark programs are -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000. ..Manas On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda wrote: > What is your cluster configuration? Did you try looking at the Web UI? > There are

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
How many nodes do you have in your cluster, how many cores, what is the size of the memory? On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar wrote: > Hi Arush, > Mine is a CDH5.3 with Spark 1.2. > The only change to my spark programs are > -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000.

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see th

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see

Concurrent batch processing

2015-02-12 Thread Matus Faro
Hi, Please correct me if I'm wrong, in Spark Streaming, next batch will not start processing until the previous batch has completed. Is there any way to be able to start processing the next batch if the previous batch is taking longer to process than the batch interval? The problem I am facing is

Re: Concurrent batch processing

2015-02-12 Thread Arush Kharbanda
It could depend on the nature of your application but spark streaming would use spark internally and concurrency should be there what is your use case? Are you sure that your configuration is good? On Fri, Feb 13, 2015 at 1:17 AM, Matus Faro wrote: > Hi, > > Please correct me if I'm wrong, in

correct way to broadcast a variable

2015-02-12 Thread freedafeng
Suppose I have an object to broadcast and then use it in a mapper function, sth like follows, (Python codes) obj2share = sc.broadcast("Some object here") someRdd.map(createMapper(obj2share)).collect() The createMapper function will create a mapper function using the shared object's value. Anothe

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Deborah Siegel
Hi Abe, I'm new to Spark as well, so someone else could answer better. A few thoughts which may or may not be the right line of thinking.. 1) Spark properties can be set on the SparkConf, and with flags in spark-submit, but settings on SparkConf take precedence. I think your jars flag for spark-su

Re: Concurrent batch processing

2015-02-12 Thread Matus Faro
I've been experimenting with my configuration for couple of days and gained quite a bit of power through small optimizations, but it may very well be something I'm doing crazy that is causing this problem. To give a little bit of a background, I am in the early stages of a project that consumes a

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Sandy Ryza
What version of Java are you using? Core NLP dropped support for Java 7 in its 3.5.0 release. Also, the correct command line option is --jars, not --addJars. On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel wrote: > Hi Abe, > I'm new to Spark as well, so someone else could answer better. A few

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
Check that your timezone is correct as well, an incorrect timezone can make it look like your time is correct when it is skewed. cheers On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim wrote: > The thing is that my time is perfectly valid... > > On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das > wrote: >

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-Sp

Re: Master dies after program finishes normally

2015-02-12 Thread Imran Rashid
The important thing here is the master's memory, that's where you're getting the GC overhead limit. The master is updating its UI to include your finished app when your app finishes, which would cause a spike in memory usage. I wouldn't expect the master to need a ton of memory just to serve the

RE: PySpark 1.2 Hadoop version mismatch

2015-02-12 Thread Michael Nazario
I looked at the environment which I ran the spark-submit command in, and it looks like there is nothing that could be messing with the classpath. Just to be sure, I checked the web UI which says the classpath contains: - The two jars I added: /path/to/avro-mapred-1.7.4-hadoop2.jar and lib/spark-

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
No, not submitting from windows, from a debian distribution. Had a quick look at the rm logs, and it seems some containers are allocated but then released again for some reason. Not easy to make sense of the logs, but here is a snippet from the logs (from a test in our small test cluster) if you'd

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
It seems unlikely to me that it would be a 2.2 issue, though not entirely impossible. Are you able to find any of the container logs? Is the NodeManager launching containers and reporting some exit code? -Sandy On Thu, Feb 12, 2015 at 1:21 PM, Anders Arpteg wrote: > No, not submitting from wi

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
So I tried this: .mapPartitions(itr => { itr.grouped(300).flatMap(items => { myFunction(items) }) }) and I tried this: .mapPartitions(itr => { itr.grouped(300).flatMap(myFunction) }) I tried making myFunction a method, a function val, and even moving it into a singleton obj

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
Looks like my clock is in sync: -bash-4.1$ date && curl -v s3.amazonaws.com Thu Feb 12 21:40:18 UTC 2015 * About to connect() to s3.amazonaws.com port 80 (#0) * Trying 54.231.12.24... connected * Connected to s3.amazonaws.com (54.231.12.24) port 80 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.19.7

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Felix C
You would probably write to hdfs or check out https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html You might be able to retrofit it to you use case. --- Original Message --- From: "Su She" Sent: February 11, 2015 10:55 PM To: "Felix C" Cc: "Kelvin Chu" <2dot7

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
The nm logs only seems to contain similar to the following. Nothing else in the same time range. Any help? 2015-02-12 20:47:31,245 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1422406067005_

Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating [] recommendProducts(int user,

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Silvio Fiorito
One method I’ve used is to publish each batch to a message bus or queue with a custom UI listening on the other end, displaying the results in d3.js or some other app. As far as I’m aware there isn’t a tool that will directly take a DStream. Spark Notebook seems to have some support for updatin

Re: Concurrent batch processing

2015-02-12 Thread Tathagata Das
So you have come across spark.streaming.concurrentJobs already :) Yeah, that is an undocumented feature that does allow multiple output operations to submitted in parallel. However, this is not made public for the exact reasons that you realized - the semantics in case of stateful operations is not

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Sean Owen
Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 As an aside, it's quite expensive to make recommendations for all users. IMHO this is not something to do, if you can avoid it architecturally. For example, consider precomputing recommendations only for users whose probability of n

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Michael Armbrust
In Spark 1.3, parquet tables that are created through the datasources API will automatically calculate the sizeInBytes, which is used to broadcast. On Thu, Feb 12, 2015 at 12:46 PM, Dima Zhiyanov wrote: > Hello > > Has Spark implemented computing statistics for Parquet files? Or is there > any o

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Michael Armbrust
You can start a JDBC server with an existing context. See my answer here: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html On Thu, Feb 12, 2015 at 7:24 AM, Todd Nist wrote: > I have a question with regards to accessing SchemaRDD’s and Spark

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Thanks, Sean! Glad to know it will be in the future release. On Thu, Feb 12, 2015 at 2:45 PM, Sean Owen wrote: > Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 > > As an aside, it's quite expensive to make recommendations for all > users. IMHO this is not something to do, if y

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread Michael Armbrust
It looks to me like perhaps your SparkContext has shut down due to too many failures. I'd look in the logs of your executors for more information. On Thu, Feb 12, 2015 at 2:34 AM, lihu wrote: > I try to use the multi-thread to use the Spark SQL query. > some sample code just like this: > > val

Predicting Class Probability with Gradient Boosting/Random Forest

2015-02-12 Thread nilesh
We are using Gradient Boosting/Random Forests that I have found provide the best results for our recommendations. My issue is that I need the probability of the 0/1 label, and not the predicted label. In the spark scala api, I see that the predict method also has an option to provide the probabilit

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
Hi, I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 fields. I know that the first field from both files are IDs. I want to find all the IDs existed in the first file, but NOT in the 2nd file. I am coming with the following code in spark-shell. case class origAsLeft

Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different fro

Re: Question about mllib als's implicit training

2015-02-12 Thread Sean Owen
Where there is no user-item interaction, you provide no interaction, not an interaction with strength 0. Otherwise your input is fully dense. On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing wrote: > Hi, > > I have some implicit rating data, such as the purchasing data. I read the > paper about th

RE: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Ganelin, Ilya
Hi all - I've spent a while playing with this. Two significant sources of speed up that I've achieved are 1) Manually multiplying the feature vectors and caching either the user or product vector 2) By doing so, if one of the RDDs is a global it becomes possible to parallelize this step by run

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
HI Sean, I am reading the paper of implicit training. Collaborative Filtering for Implicit Feedback Datasets It mentioned "To this end, let us introduce a set of binary variables p_ui, which indicates the preference of user u to item i. Th

Re: Question about mllib als's implicit training

2015-02-12 Thread Sean Owen
This all describes how the implementation operates, logically. The matrix P is never formed, for sure, certainly not by the caller. The implementation actually extends to handle negative values in R too but it's all taken care of by the implementation. On Thu, Feb 12, 2015 at 11:29 PM, Crystal Xi

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Many thanks! On Thu, Feb 12, 2015 at 3:31 PM, Sean Owen wrote: > This all describes how the implementation operates, logically. The > matrix P is never formed, for sure, certainly not by the caller. > > The implementation actually extends to handle negative values in R too > but it's all taken c

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
The more I'm thinking about this- I may try this instead: val myChunkedRDD: RDD[List[Event]] = inputRDD.mapPartitions(_ .grouped(300).toList) I wonder if this would work. I'll try it when I get back to work tomorrow. Yuyhao, I tried your approach too but it seems to be somehow moving all the da

RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
OK. I think I have to use "None" instead null, then it works. Still switching from Java. I can also just use the field name as what I assume. Great experience. From: java8...@hotmail.com To: user@spark.apache.org Subject: spark left outer join with java.lang.UnsupportedOperationException: empty

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Todd Nist
Thanks Michael. I will give it a try. On Thu, Feb 12, 2015 at 6:00 PM, Michael Armbrust wrote: > You can start a JDBC server with an existing context. See my answer here: > http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html > > On Thu, Feb

Re: exception with json4s render

2015-02-12 Thread Mohnish Kodnani
Any ideas on how to figure out what is going on when using json4s 3.2.11. I have a need to use 3.2.11 and just to see if things work I had downgraded to 3.2.10 and things started working. On Wed, Feb 11, 2015 at 11:45 AM, Charles Feduke wrote: > I was having a similar problem to this trying to

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
Thanks very much, you're right. I called the sc.stop() before the execute pool shutdown. On Fri, Feb 13, 2015 at 7:04 AM, Michael Armbrust wrote: > It looks to me like perhaps your SparkContext has shut down due to too > many failures. I'd look in the logs of your executors for more informatio

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
Hi Gerard, Great write-up and really good guidance in there. I have to be honest, I don't know why but setting # of partitions for each dStream to a low number (5-10) just causes the app to choke/crash. Setting it to 20 gets the app going but with not so great delays. Bump it up to 30 and I start

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Kevin (Sangwoo) Kim
Apache Zeppelin also has a scheduler and then you can reload your chart periodically, Check it out: http://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html On Fri Feb 13 2015 at 7:29:00 AM Silvio Fiorito < silvio.fior...@granturing.com> wrote: > One method I’ve used is to publish ea

Re: Extract hour from Timestamp in Spark SQL

2015-02-12 Thread Michael Armbrust
This looks like your executors aren't running a version of spark with hive support compiled in. On Feb 12, 2015 7:31 PM, "Wush Wu" wrote: > Dear Michael, > > After use the org.apache.spark.sql.hive.HiveContext, the Exception: > "java.util. > NoSuchElementException: key not found: hour" is gone du

Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Manoj Samel
d...@spark.apache.org mentioned on http://spark.apache.org/community.html seems to be bouncing. Is there another one ?

Re: Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-12 Thread Tathagata Das
Could you come up with a minimal example through which I can reproduce the problem? On Tue, Feb 10, 2015 at 12:30 PM, conor wrote: > I am getting the following error when I kill the spark driver and restart > the job: > > 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint fro

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of "foreachRDD { // write to kafka }" if you do "dstream.count", then the delay is stable. Right? 2. If so, then Kafka is the bottleneck. Is the number of partitions, that you spoke

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Harika
Hi, I've been reading about Spark SQL and people suggest that using HiveContext is better. So can anyone please suggest a solution to the above problem. This is stopping me from moving forward with HiveContext. Thanks Harika -- View this message in context: http://apache-spark-user-list.10015

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-12 Thread Tathagata Das
Can you give me the whole logs? TD On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg wrote: > OK that worked and getting close here ... the job ran successfully for a > bit and I got output for the first couple buckets before getting a > "java.lang.Exception: Could not compute split, block input-0-14

Re: streaming joining multiple streams

2015-02-12 Thread Tathagata Das
Sorry for the late response. With the amount of data you are planning join, any system would take time. However, between Hive's MapRduce joins, and Spark's basic shuffle, and Spark SQL's join, the latter wins hands down. Furthermore, with the APIs of Spark and Spark Streaming, you will have to do s

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current implementation

Re: Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Ted Yu
dev@spark is active. e.g. see: http://search-hadoop.com/m/JW1q5zQ1Xw/renaming+SchemaRDD+-%253E+DataFrame&subj=renaming+SchemaRDD+gt+DataFrame Cheers On Thu, Feb 12, 2015 at 8:09 PM, Manoj Samel wrote: > d...@spark.apache.org > mentio

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
1) Yes, if I disable writing out to kafka and replace it with some very light weight action is rdd.take(1), the app is stable. 2) The partitions I spoke of in the previous mail are the number of partitions I create from each dStream. But yes, since I do processing and writing out, per partition, e

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
Hi Corey, I would not recommend using the CatalystScan for this. Its lower level, and not stable across releases. You should be able to do what you want with PrunedFilteredScan

  1   2   >