Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu
Could you provide all pieces of codes which can reproduce the bug? Here is my test code: import org.apache.spark._ import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("SimpleApp") val sc = new SparkContext(conf

Re: Bug in Accumulators...

2014-11-07 Thread Aaron Davidson
This may be due in part to Scala allocating an anonymous inner class in order to execute the for loop. I would expect if you change it to a while loop like var i = 0 while (i < 10) { sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) i += 1 } then the problem may go away. I am not sup

RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu
Michael any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 2:36 PM To: mich...@databricks.com; user Subject: CheckPoint Issue with JsonRDD When we enable checkpoint and use JsonRDD we get the following error: Is this bug ? Except

RE: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Jahagirdar, Madhu
Any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 12:28 PM To: Michael Armbrust Cc: u...@spark.incubator.apache.org Subject: RE: Dynamically InferSchema From Hive and Create parquet file When I create Hive table with Parquet format

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta
I am trying to group by on a calculated field. Is it supported on spark sql? I am running it on a nested json structure. Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c group by YEAR(c.Patient.DOB) Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4. Error

about write mongodb in mapPartitions

2014-11-07 Thread qinwei
Hi, everyone     I come across with a prolem about writing data to mongodb in mapPartitions, my code is as below:                 val sourceRDD = sc.textFile("hdfs://host:port/sourcePath")          // some transformations        val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        va

Re: multiple spark context in same driver program

2014-11-07 Thread Tobias Pfeiffer
Hi, On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das wrote: > > That doc was created during the initial days (Spark 0.8.0), you can of > course create multiple sparkContexts in the same driver program now. > You sure about that? According to http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-

Re: about write mongodb in mapPartitions

2014-11-07 Thread Akhil Das
Why not saveAsNewAPIHadoopFile? //Define your mongoDB confs val config = new Configuration() config.set("mongo.output.uri", "mongodb:// 127.0.0.1:27017/sigmoid.output") //Write everything to mongo rdd.saveAsNewAPIHadoopFile("file:///some/random", classOf[Any], classOf[Any], classOf[com.m

Re: about write mongodb in mapPartitions

2014-11-07 Thread Tobias Pfeiffer
Hi, On Fri, Nov 7, 2014 at 6:23 PM, qinwei wrote: > > args.map(arg => { > coll.insert(new BasicDBObject("pkg", arg)) > arg > }) > > mongoClient.close() > args > As the results of args.map are never used anywhere, I t

Re: multiple spark context in same driver program

2014-11-07 Thread Akhil Das
My bad, I just fired up a spark-shell and created a new sparkContext and it was working fine. I basically did a parallelize and collect with both sparkContexts. Thanks Best Regards On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer wrote: > Hi, > > On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das > wrot

Native / C/C++ code integration

2014-11-07 Thread Paul Wais
Dear List, Has anybody had experience integrating C/C++ code into Spark jobs? I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs: * Shipping the native dylib

Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu
Now it doesn't support such query. I can easily reproduce it. Created a JIRA here: https://issues.apache.org/jira/browse/SPARK-4296 Best Regards, Shixiong Zhu 2014-11-07 16:44 GMT+08:00 Tridib Samanta : > I am trying to group by on a calculated field. Is it supported on spark > sql? I am running

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-11-07 Thread Sree Harsha
@rogthefrog Were you able to figure out how to fix this issue? Even I tried all combinations that possible but no luck yet. Thanks, Harsha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html

MESOS slaves shut down due to "'health check timed out"

2014-11-07 Thread Yangcheng Huang
Hi guys Do you know how to handle the following case - = From MESOS log file = Slave asked to shut down by master@:5050 because 'health check timed out' I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework === Any configurations to

Re: Store DStreams into Hive using Hive Streaming

2014-11-07 Thread Luiz Geovani Vier
Hi Ted and Silvio, thanks for your responses. Hive has a new API for streaming ( https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) that takes care of compaction and doesn't require any downtime for the table. The data is immediately available and Hive will combine files in ba

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
you're right, serialization works. what is your suggestion on saving a "distributed" model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some

error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto
I'm getting this error when importing hive context >>> from pyspark.sql import HiveContext Traceback (most recent call last): File "", line 1, in File "/path/spark-1.1.0/python/pyspark/__init__.py", line 63, in from pyspark.context import SparkContext File "/path/spark-1.1.0/python/pys

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh
thanks reza. i'm not familiar with the "block matrix multiplication", but is it a good fit for "very large dimension, but extremely sparse" matrix? if not, what is your recommendation on implementing matrix multiplication in spark on "very large dimension, but extremely sparse" matrix? On Thu

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
Currently I see the word2vec model is collected onto the master, so the model itself is not distributed.  I guess the question is why do you need  a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot fit

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks
There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing some

where is the org.apache.spark.util package?

2014-11-07 Thread ll
i'm trying to compile some of the spark code directly from the source (https://github.com/apache/spark). it complains about the missing package org.apache.spark.util. it doesn't look like this package is part of the source code on github. where can i find this package? -- View this message i

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
For ALS if you want real time recs (and usually this is order 10s to a few 100s ms response), then Spark is not the way to go - a serving layer like Oryx, or prediction.io is what you want. (At graphflow we've built our own). You hold the factor matrices in memory and do the dot product in

Re: where is the org.apache.spark.util package?

2014-11-07 Thread ll
i found util package under spark core package, but i now got this error "Sysmbol Utils is inaccessible from this place". what does this error mean? the org.apache.spark.util and org.apache.spark.spark.Utils are there now. thanks. -- View this message in context: http://apache-spark-user-li

Re: deploying a model built in mllib

2014-11-07 Thread chirag lakhani
Thanks for letting me know about this, it looks pretty interesting. From reading the documentation it seems that the server must be built on a Spark cluster, is that correct? Is it possible to deploy it in on a Java server? That is how we are currently running our web app. On Tue, Nov 4, 2014

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
hi nick.. sorry about the confusion. originally i had a question specifically about word2vec, but my follow up question on distributed model is a more general question about saving different types of models. on distributed model, i was hoping to implement a model parallelism, so that different wo

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
yep, but that's only if they are already represented as RDDs. which is much more convenient for saving and loading. my question is for the use case that they are not represented as RDDs yet. then, do you think if it makes sense to covert them into RDDs, just for the convenience of saving and load

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
thansk nick. i'll take a look at oryx and prediction.io. re: private val model in word2vec ;) yes, i couldn't wait so i just changed it in the word2vec source code. but i'm running into some compiliation issue now. hopefully i can fix it soon, so to get this things going. On Fri, Nov 7, 2014 a

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
Sure - in theory this sounds great. But in practice it's much faster and a whole lot simpler to just serve the model from single instance in memory. Optionally you can multithread within that (as Oryx 1 does). There are very few real world use cases where the model is so large that it HAS to b

Re: AVRO specific records

2014-11-07 Thread Simone Franzini
Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that I have not fully solved yet. I am able to run with Hadoop1 and AVRO in standalone mode but not with Hadoop2 (even after trying to fix the dependencies). Anyway, I am now trying to write to AVRO, using a very similar snippet

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Reza Zadeh
If you're have very large and very sparse matrix represented as (i, j, value) entries, then you can try the algorithms mentioned in the post brought up earlier. Reza On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh wrote: > thanks reza.

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Simon Chan
Just want to elaborate more on Duy's suggestion on using PredictionIO. PredictionIO will store the model automatically if you return it in the training function. An example using CF: def train(data: PreparedData): PersistentMatrixFactorizationModel = { val m = ALS.train(data.ratings, ap.rank

spark-submit inside script... need some bash help

2014-11-07 Thread Koert Kuipers
i need to run spark-submit inside a script with options that are build up programmatically. oh and i need to use exec to keep the same pid (so it can run as a service and be killed). this is what i tried: == #!/bin/bash -e SPARK_SUBMIT=/usr/loca

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Michael Armbrust
Perhaps if you can describe what you are trying to accomplish at high level it'll be easier to help. On Fri, Nov 7, 2014 at 12:28 AM, Jahagirdar, Madhu < madhu.jahagir...@philips.com> wrote: > Any idea on this? > > From: Jahagirdar, Madhu > Sent: Thursday,

partitioning to speed up queries

2014-11-07 Thread Gordon Benjamin
Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT OVERWRI

Multiple Applications(Spark Contexts) Concurrently Fail With Broadcast Error

2014-11-07 Thread ryaminal
We are unable to run more than one application at a time using Spark 1.0.0 on CDH5. We submit two applications using two different SparkContexts on the same Spark Master. The Spark Master was started using the following command and parameters and is running in standalone mode: > /usr/java/jdk1.7.0

Still struggling with building documentation

2014-11-07 Thread Alessandro Baretta
I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build

jsonRdd and MapType

2014-11-07 Thread boclair
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). I'd like some of the json fields to be in a MapType rather than a sub StructType, as the keys will be very sparse. For example: > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRD

Re: Still struggling with building documentation

2014-11-07 Thread Nicholas Chammas
I believe the web docs need to be built separately according to the instructions here . Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the m

Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Tathagata Das
I am not aware of any obvious existing pattern that does exactly this. Generally this sort of computation (subset, denormalization) things are so generic sounding terms but actually have very specific requirements that it hard to refer to a design pattern without more requirement info. If you want

Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai
Hello Brian, Right now, MapType is not supported in the StructType provided to jsonRDD/jsonFile. We will add the support. I have created https://issues.apache.org/jira/browse/SPARK-4302 to track this issue. Thanks, Yin On Fri, Nov 7, 2014 at 3:41 PM, boclair wrote: > I'm loading json into spa

Re: deploying a model built in mllib

2014-11-07 Thread Donald Szeto
Hi Chirag, Could you please provide more information on your Java server environment? Regards, Donald ᐧ On Fri, Nov 7, 2014 at 9:57 AM, chirag lakhani wrote: > Thanks for letting me know about this, it looks pretty interesting. From > reading the documentation it seems that the server must be

Re: Parallelize on spark context

2014-11-07 Thread _soumya_
Naveen, Don't be worried - you're not the only one to be bitten by this. A little inspection of the Javadoc told me you have this other option: JavaRDD distData = sc.parallelize(data, 100); -- Now the RDD is split into 100 partitions. -- View this message in context: http://apache-spark-us

spark streaming: stderr does not roll

2014-11-07 Thread Nguyen, Duc
We are running spark streaming jobs (version 1.1.0). After a sufficient amount of time, the stderr file grows until the disk is full at 100% and crashes the cluster. I've read this https://github.com/apache/spark/pull/895 and also read this http://spark.apache.org/docs/latest/configuration.html#

Integrating Spark with other applications

2014-11-07 Thread gtinside
Hi , I have been working on Spark SQL and want to expose this functionality to other applications. Idea is to let other applications to send sql to be executed on spark cluster and get the result back. I looked at spark job server (https://github.com/ooyala/spark-jobserver) but it provides a RESTf

Re: Integrating Spark with other applications

2014-11-07 Thread Thomas Risberg
Hi, I'm a committer on that spring-hadoop project and I'm also interested in integrating Spark with other Java applications. I would love to see some guidance from the Spark community for the best way to accomplish this. We have plans to add features to work with Spark Apps in similar ways we now

Re: error when importing HiveContext

2014-11-07 Thread Davies Liu
bin/pyspark will setup the PYTHONPATH of py4j for you, or you need to setup it by yourself. export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip On Fri, Nov 7, 2014 at 8:15 AM, Pagliari, Roberto wrote: > I’m getting this error when importing hive context > > > from pyspark.sql impo

spark context not defined

2014-11-07 Thread Pagliari, Roberto
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and hive 0.9.0. When using python 2.7 from pyspark.sql import HiveContext sqlContext = HiveContext(sc) I'm getting 'sc not defined' On the other hand, I can see 'sc' from pyspark CLI. Is there a way to fix it?

Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Anand Iyer
Hi TD, This is a common pattern that is emerging today. Kafka --> SS --> Kafka. Spark Streaming comes with a built in consumer to read from Kafka. It will be great to have an easy way for users to write back to Kafka without having to code a customer producer using the Kafka Producert APIs. Are

MatrixFactorizationModel serialization

2014-11-07 Thread Dariusz Kobylarz
I am trying to persist MatrixFactorizationModel (Collaborative Filtering example) and use it in another script to evaluate/apply it. This is the exception I get when I try to use a deserialized model instance: Exception in thread "main" java.lang.NullPointerException at org.apache.spark.rdd

Re: MatrixFactorizationModel serialization

2014-11-07 Thread Sean Owen
Serializable like a Java object? no, it's an RDD. A factored matrix model is huge, unlike most models, and is not a local object. You can of course persist the RDDs to storage manually and read them back. On Fri, Nov 7, 2014 at 11:33 PM, Dariusz Kobylarz wrote: > I am trying to persist MatrixFact

SparkPi endlessly in "yarnAppState: ACCEPTED"

2014-11-07 Thread YaoPau
I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output after submitting the SparkPi example in yarn cluster mode (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html) using: spark-submit --class org.apache.spark.examples.Sp

Re: SparkPi endlessly in "yarnAppState: ACCEPTED"

2014-11-07 Thread jayunit100
Sounds like no free yarn workers. i.e. try running: hadoop-mapreduce-examples-2.1.0-beta.jar pi 1 1 We have some smoke tests which you might find particularly usefull for yarn clusters as well in https://github.com/apache/bigtop, underneath bigtop-tests/smoke-tests which are generally good to

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

2014-11-07 Thread Davies Liu
Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane wrote: > I don't have any insight into this bug, but on Spark version 1.0.0 I ran into > the same bug running the 'sort.py' example. On a smaller data set, it worked > fine. On a

Spark 1.1.0 Can not read snappy compressed sequence file

2014-11-07 Thread Stéphane Verlet
I first saw this using SparkSQL but the result is the same with plain Spark. 14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z at org.apache.hadoop.util.NativeCodeLoader.buildS

How to add elements into map?

2014-11-07 Thread Tim Chou
Here is the code I run in spark-shell: val table = sc.textFile(args(1)) val histMap = collection.mutable.Map[Int,Int]() for (x <- table) { val tuple = x.split('|') histMap.put(tuple(0).toInt, 1) } Why is histMap still null? Is there something wrong with my code? Thanks

Fwd: How to add elements into map?

2014-11-07 Thread Tim Chou
Here is the code I run in spark-shell: val table = sc.textFile(args(1)) val histMap = collection.mutable.Map[Int,Int]() for (x <- table) { val tuple = x.split('|') histMap.put(tuple(0).toInt, 1) } Why is histMap still null? Is there something wrong with my code? Thanks,

Re: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread ll
hi. i did use local[8] as below, but it still ran on only 1 core. val sc = new SparkContext(new SparkConf().setMaster("local[8]").setAppName("abc")) any advice is much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using

RE: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread Ganelin, Ilya
To set the number of spark cores used you must set two parameters in the actual spark-submit script. You must set num-executors (the number of nodes to have) and executor-cores (the number of cores per machinel) . Please see the Spark configuration and tuning pages for more details. -Origi

Re: How to add elements into map?

2014-11-07 Thread lalit1303
It doesn't work that way. Following is the correct way: val table = sc.textFile(args(1)) val histMap = table.map(x => { x.split('|')(0).toInt,1 }) - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-e

Re: Viewing web UI after fact

2014-11-07 Thread Arun Ahuja
We are running our applications through YARN and are only somtimes seeing them into the History Server. Most do not seem to have the APPLICATION_COMPLETE file. Specifically any job that ends because of "yarn application -kill" does not show up. For other ones what would be a reason for them not

Re: Using partitioning to speed up queries in Shark

2014-11-07 Thread Mayur Rustagi
- dev list & + user list Shark is not officially supported anymore so you are better off moving to Spark SQL. Shark doesnt support Hive partitioning logic anyways, it has its version of partitioning on in-memory blocks but is independent of whether you partition your data in hive or not. Mayur R