Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-02 Thread Xiangrui Meng
I reproduced the bug on master and submitted a patch for it: https://github.com/apache/spark/pull/5329. It may get into Spark 1.3.1. Thanks for reporting the bug! -Xiangrui On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa wrote: > Hmm, I got the same error with the master. Here is another test

Re: Starting httpd: http: Syntax error on line 154

2015-04-02 Thread Ganon Pierce
Now I am receiving: Starting httpd: httpd: Syntax error on line 199 of /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such file or directory > On Apr 2, 2015, at 1:05 AM, Ganon Pierce wrote: > >

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Xiangrui Meng
I think before 1.3 you also get stackoverflow problem in > ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1,

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2, 2015 00:16, "Xiangrui Meng" wrote: > I think before 1.3 you also get stackoverflow problem in > ~35 > iterations. In 1.3.x, please use setCheckpointInterval to solve t

Re: Streaming anomaly detection using ARIMA

2015-04-02 Thread Sean Owen
This "inside out" parallelization has been a way people have used R with MapReduce for a long time. Run N copies of an R script on the cluster, on different subsets of the data, babysat by Mappers. You just need R installed on the cluster. Hadoop Streaming makes this easy and things like RDD.pipe i

Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel wrote: > I am attempting to read an hbase table in pyspark with a range scan. > > conf = { > "hbase.zoo

Re: Spark, snappy and HDFS

2015-04-02 Thread Sean Owen
Yes, any Hadoop-related process that asks for Snappy compression or needs to read it will have to have the Snappy libs available on the library path. That's usually set up for you in a distro or you can do it manually like this. This is not Spark-specific. The second question also isn't Spark-spec

JAVA_HOME problem

2015-04-02 Thread 董帅阳
spark 1.3.0 spark@pc-zjqdyyn1:~> tail /etc/profile export JAVA_HOME=/usr/jdk64/jdk1.7.0_45 export PATH=$PATH:$JAVA_HOME/bin # # End of /etc/profile #‍ But ERROR LOG Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454 ==

Re: Spark throws rsync: change_dir errors on startup

2015-04-02 Thread Horsmann, Tobias
Hi, Verbose output showed no additional information about the origin of the error rsync from right sending incremental file list sent 20 bytes received 12 bytes 64.00 bytes/sec total size is 0 speedup is 0.00 starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark130/sbin/

How to learn Spark ?

2015-04-02 Thread Star Guo
Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo

how to find near duplicate items from given dataset using spark

2015-04-02 Thread Somnath Pandeya
Hi All, I want to find near duplicate items from given dataset For e.g consider a data set 1. Cricket,bat,ball,stumps 2. Cricket,bowler,ball,stumps, 3. Football,goalie,midfielder,goal 4. Football,refree,midfielder,goal, Here 1 and 2 are near duplicates (only field 2 is

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Nick Pentreath
Fair enough but I'd say you hit that diminishing return after 20 iterations or so... :) On Thu, Apr 2, 2015 at 9:39 AM, Justin Yip wrote: > Thanks Xiangrui, > > I used 80 iterations to demonstrates the marginal diminishing return in > prediction quality :) > > Justin > On Apr 2, 2015 00:16, "Xia

Re: HiveContext setConf seems not stable

2015-04-02 Thread Hao Ren
Hi, Jira created: https://issues.apache.org/jira/browse/SPARK-6675 Thank you. On Wed, Apr 1, 2015 at 7:50 PM, Michael Armbrust wrote: > Can you open a JIRA please? > > On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren wrote: > >> Hi, >> >> I find HiveContext.setConf does not work correctly. Here are s

Setup Spark jobserver for Spark SQL

2015-04-02 Thread Harika
Hi, I am trying to Spark Jobserver( https://github.com/spark-jobserver/spark-jobserver ) for running Spark SQL jobs. I was able to start the server but when I run my application(my Scala class which extends SparkSqlJob), I am getting the follo

Support for Data flow graphs and not DAG only

2015-04-02 Thread anshu shukla
Hey , I didn't find any documentation regarding support for cycles in spark topology , although storm supports this using manual configuration in acker function logic (setting it to a particular count) .By cycles i doesn't mean infinite loops . -- Thanks & Regards, Anshu Shukla

Re: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-02 Thread Wollert, Fabian
Hey Christopher, I'm working with Teng on this issue. Thank you for the explanation. I tried both workarounds: just leaving hive.metastore.warehouse.dir empty is not doing anything. Still the tmp data is written to S3 and the job attempts to rename/copy+delete from S3 to S3. But anyway, since the

回复:How to learn Spark ?

2015-04-02 Thread luohui20001
The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download->deploy it in standalone mode->run some examples->try cluster deploy mode-> then try to develop your own app and deploy it in your spark cluster.

[SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve colu

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
NO, I’m referring to the result. you means there might be so many zero features in the als result ? I think it is not related to the initial state, but I do not know why the percent of zero-vector is so high(50% around) > 在 2015年4月2日,下午6:08,Sean Owen 写道: > > You're referring to the initializ

Re: Error reading smallin in hive table with parquet format

2015-04-02 Thread Masf
No, in my company are using cloudera distributions and 1.2.0 is the last version of spark. Thanks On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust wrote: > Can you try with Spark 1.3? Much of this code path has been rewritten / > improved in this version. > > On Wed, Apr 1, 2015 at 7:53 AM

Re: How to learn Spark ?

2015-04-02 Thread prabeesh k
You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo wrote: > Hi, all > > > > I am new to here. Could you give me some suggestion to learn Spark ? > Thanks. > > > > Best Regards, > > Star Guo >

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread Sean Owen
Right, I asked because in your original message, you were looking at the initialization to a random vector. But that is the initial state, not final state. On Thu, Apr 2, 2015 at 11:51 AM, lisendong wrote: > NO, I’m referring to the result. > you means there might be so many zero features in the

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
Oh, I found the reason. according to the ALS optimization formula : If a user’s all ratings are zero, that is, the R(i, Ii) is a zero matrix, so the final result feature of this user will be all-zero vector… > 在 2015年4月2日,下午6:08,Sean Owen 写道: > > You're referring to the initialization, not

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
yes! thank you very much:-) > 在 2015年4月2日,下午7:13,Sean Owen 写道: > > Right, I asked because in your original message, you were looking at > the initialization to a random vector. But that is the initial state, > not final state. > > On Thu, Apr 2, 2015 at 11:51 AM, lisendong wrote: >> NO, I’m r

Mllib kmeans #iteration

2015-04-02 Thread podioss
Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could run the algorithm with fixed number of iterations in some way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html Sen

Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling.

Re: Connection pooling in spark jobs

2015-04-02 Thread Ted Yu
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri wrote: > > Hi, > > We have a case that we will have to run concurrent jobs (for the same > algorithm) on different data sets. An

Matrix Transpose

2015-04-02 Thread Spico Florin
Hello! I have a CSV file that has the following content: C1;C2;C3 11;22;33 12;23;34 13;24;35 What is the best approach to use Spark (API, MLLib) for achieving the transpose of it? C1 11 12 13 C2 22 23 24 C3 33 34 35 I look forward for your solutions and suggestions (some Scala code will be rea

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy
You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo wrote: > Hi, all > > > > I am new to here. Could you give me some suggestion to learn Spark ? > Thanks.

Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu
Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile("/tmp/emp.csv", 4) .map ( _.split(",") )

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, "Ted Yu" wrote: > http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm > > The question doesn't seem to be Spark specif

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Dean Wampler
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection in the Spark SQL library. Make sure it's in the classpath and the version is correct, too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Types

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
I have a "self-study" workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://pol

A stream of json objects using Java

2015-04-02 Thread James King
I'm reading a stream of string lines that are in json format. I'm using Java with Spark. Is there a way to get this from a transformation? so that I end up with a stream of JSON objects. I would also welcome any feedback about this approach or alternative approaches. thanks jk

Re: A stream of json objects using Java

2015-04-02 Thread Sean Owen
This just reduces to finding a library that can translate a String of JSON into a POJO, Map, or other representation of the JSON. There are loads of these, like Gson or Jackson. Sure, you can easily use these in a function that you apply to each JSON string in each line of the file. It's not differ

RE: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Ashley Rose
That’s precisely what I was trying to check. It should have 42577 records in it, because that’s how many there were in the text file I read in. // Load a text file and convert each line to a JavaBean. JavaRDD lines = sc.textFile("file.txt"); JavaRDD tbBER = lines.map(s ->

From DataFrame to LabeledPoint

2015-04-02 Thread drarse
Hello!, I have a questions since days ago. I am working with DataFrame and with Spark SQL I imported a jsonFile: /val df = sqlContext.jsonFile("file.json")/ In this json I have the label and de features. I selected it: / val features = df.select ("feature1","feature2","feature3",...); val labe

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika wrote: > Hi,

Re: Connection pooling in spark jobs

2015-04-02 Thread Cody Koeninger
Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part => setupPool part.map { ... See "Design Patterns for using foreachRDD" i

Spark streaming error in block pushing thread

2015-04-02 Thread Bill Young
I am running a standalone Spark streaming cluster, connected to multiple RabbitMQ endpoints. The application will run for 20-30 minutes before raising the following error: WARN 2015-04-01 21:00:53,944 > org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove > RDD 22 - Ask time

A problem with Spark 1.3 artifacts

2015-04-02 Thread Jacek Lewandowski
A very simple example which works well with Spark 1.2, and fail to compile with Spark 1.3: build.sbt: name := "untitled" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" Test.scala: package org.apache.spark.metrics import org.apache.s

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Peter Rudenko
Hi try next code: |val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) => LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } | Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago.

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Dean Wampler
To clarify one thing, is count() the first "action" ( http://spark.apache.org/docs/latest/programming-guide.html#actions) you're attempting? As defined in the programming guide, an action forces evaluation of the pipeline of RDDs. It's only then that reading the data actually occurs. So, count() mi

Re: How to learn Spark ?

2015-04-02 Thread Star Guo
Thank you ! I Begin with it. Best Regards, Star Guo I have a "self-study" workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re How to learn Spark ?

2015-04-02 Thread Star Guo
So cool !! Thanks. Best Regards, Star Guo = You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Tha

Spark Streaming Worker runs out of inodes

2015-04-02 Thread andrem
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see tons of old shuffle_*.data and *.index files that are never deleted. How do we get Spark to remove these files? We have a simple standalone app with one RabbitMQ receive

Re: Re:How to learn Spark ?

2015-04-02 Thread Star Guo
Thanks a lot. Follow you suggestion . Best Regards, Star Guo = The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download->deploy it in standalone mode->run som

Re: workers no route to host

2015-04-02 Thread Dean Wampler
It appears you are using a Cloudera Spark build, 1.3.0-cdh5.4.0-SNAPSHOT, which expects to find the hadoop command: /data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop: command not found If you don't want to use Hadoop, download one of the pre-built Spark releases from spark.ap

conversion from java collection type to scala JavaRDD

2015-04-02 Thread Jeetendra Gangele
Hi All Is there an way to make the JavaRDD from existing java collection type List? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra

Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers
Thanks all. I was able to get the decompression working by adding the following to my spark-env.sh script: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRAR

Re: conversion from java collection type to scala JavaRDD

2015-04-02 Thread Dean Wampler
Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Spark Streaming Error in block pushing thread

2015-04-02 Thread byoung
I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out

Spark SQL. Memory consumption

2015-04-02 Thread Masf
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 < -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(I

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need "local[3]" at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Thank you for the response, Dean. There are 2 worker nodes, with 8 cores total, attached to the stream. I have the following settings applied: spark.executor.memory 21475m spark.cores.max 16 spark.driver.memory 5235m On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler wrote: > Are you allocating 1 c

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Sorry for the obvious typo, I have 4 workers with 16 cores total* On Thu, Apr 2, 2015 at 11:56 AM, Bill Young wrote: > Thank you for the response, Dean. There are 2 worker nodes, with 8 cores > total, attached to the stream. I have the following settings applied: > > spark.executor.memory 21475m

Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-02 Thread Wang, Ningjun (LNG-NPV)
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark writes to this folder. I found that the disk space of this folder keep on increase quickly and at certain point I will run out of disk space. I wonder does spark clean up the disk spac in this folder once the shuffle

RE: Spark SQL. Memory consumption

2015-04-02 Thread Cheng, Hao
Spark SQL tries to load the entire partition data and organized as In-Memory HashMaps, it does eat large memory if there are not many duplicated group by keys with large amount of records; Couple of things you can try case by case: ·Increasing the partition numbers (the records count in

Re: How to learn Spark ?

2015-04-02 Thread Star Guo
Yes, I just search for it ! Best Regards, Star Guo == You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo wrote:

Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify *build.sbt*: --- *import AssemblyKeys._* *name := "Kinesis Consumer"* *version := "1.0"organization := "com.myconsumer"scalaVersion := "2.11.5"

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy
Thanks Dean. This is great. ᐧ On Thu, Apr 2, 2015 at 9:01 AM, Dean Wampler wrote: > I have a "self-study" workshop here: > > https://github.com/deanwampler/spark-workshop > > dean > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition >

Re: How to learn Spark ?

2015-04-02 Thread Slim Baltagi
Hi I maintain an Apache Spark Knowledge Base at http://www.SparkBigData.com with over 4,000 related web resources. You can check the ‘Quick Start’ section at http://sparkbigdata.com/tutorials There is a plenty of tutorials and examples to start with after you decide what you would like to use:

Re: Spark + Kinesis

2015-04-02 Thread Kelly, Jonathan
It looks like you're attempting to mix Scala versions, so that's going to cause some problems. If you really want to use Scala 2.11.5, you must also use Spark package versions built for Scala 2.11 rather than 2.10. Anyway, that's not quite the correct way to specify Scala dependencies in build

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
You're welcome. Two limitations to know about: 1. I haven't updated it to 1.3 2. It uses Scala for all examples (my bias ;), so less useful if you don't want to use Scala. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Type

Re: Spark SQL. Memory consumption

2015-04-02 Thread Vladimir Rodionov
>> Using large memory for executors (*--executor-memory 120g*). Not really a good advice. On Thu, Apr 2, 2015 at 9:17 AM, Cheng, Hao wrote: > Spark SQL tries to load the entire partition data and organized as > In-Memory HashMaps, it does eat large memory if there are not many > duplicated gro

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
I misread that you're running in standalone mode, so ignore the "local[3]" example ;) How many separate readers are listening to rabbitmq topics? This might not be the problem, but I'm just eliminating possibilities. Another possibility is that the in-bound data rate exceeds your ability to proce

Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Todd Nist
Hi Akhil, Tried your suggestion to no avail. I actually to not see and "jackson" or "json serde" jars in the $HIVE/lib directory. This is hive 0.13.1 and spark 1.2.1 Here is what I did: I have added the lib folder to the –jars option when starting the spark-shell, but the job fails. The hive-s

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: > # write data from an ar

Re: Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
Thanks Jonathan. Helpful. VB > On Apr 2, 2015, at 1:15 PM, Kelly, Jonathan wrote: > > It looks like you're attempting to mix Scala versions, so that's going to > cause some problems. If you really want to use Scala 2.11.5, you must also > use Spark package versions built for Scala 2.11 rath

Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist
I was trying a simple test from the spark-shell to see if 1.3.0 would address a problem I was having with locating the json_tuple class and got the following error: scala> import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala> val sqlContext = new HiveContext(sc)sqlContext:

Spark 1.3 UDF ClassNotFoundException

2015-04-02 Thread ganterm
Hello, I started to use the dataframe API in Spark 1.3 with Scala. I am trying to implement a UDF and am following the sample here: https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.UserDefinedFunction meaning val predict = udf((score: Double) => if (score > 0.5) tr

Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread ogoh
Hello, My ETL uses sparksql to generate parquet files which are served through Thriftserver using hive ql. It especially defines a schema programmatically since the schema can be only known at runtime. With spark 1.2.1, it worked fine (followed https://spark.apache.org/docs/latest/sql-programming

Simple but faster data streaming

2015-04-02 Thread Harut Martirosyan
Hi guys. Is there a more lightweight way of stream processing with Spark? What we want is a simpler way, preferably with no scheduling, which just streams the data to destinations multiple. We extensively use Spark Core, SQL, Streaming, GraphX, so it's our main tool and don't want to add new thin

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Joseph Bradley
Peter's suggestion sounds good, but watch out for the match case since I believe you'll have to match on: case (Row(feature1, feature2, ...), Row(label)) => On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko wrote: > Hi try next code: > > val labeledPoints: RDD[LabeledPoint] = features.zip(labels).

Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss wrote: > Hello, > i am running the Kmeans algorithm in cluster mode from Mllib and i was > wondering if i could run the algorithm

RE: Date and decimal datatype not working

2015-04-02 Thread BASAK, ANANDA
Thanks all. Finally I am able to run my code successfully. It is running in Spark 1.2.1. I will try it on Spark 1.3 too. The major cause of all errors I faced was that the delimiter was not correctly declared. val TABLE_A = sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).m

Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez
+1. Caching is way too slow. On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti wrote: > Hi Experts, > > I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL > queries repetitively. > > Few questions : > > 1. When I do the below (persist to memory after reading from disk), it takes

Mesos - spark task constraints

2015-04-02 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to figure out how to run spark jobs on a mesos cluster. The mesos cluster has some nodes that have tachyon install on some nodes and I would like the spark jobs to be started on only those nodes. Each of these nodes have been configure

Need a spark mllib tutorial

2015-04-02 Thread Phani Yadavilli -X (pyadavil)
Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced to the spark documentation example. But, I do not find any. Need help. Regards Phani Kumar

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez
To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap. With fewer machines, try running 4 or 5 cores per executor and only 3-4 executors (1 per node): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Ought to reduce shuffle performance hit

Re: Spark 1.3 UDF ClassNotFoundException

2015-04-02 Thread Ted Yu
Can you show more code in CreateMasterData ? How do you run your code ? Thanks On Thu, Apr 2, 2015 at 11:06 AM, ganterm wrote: > Hello, > > I started to use the dataframe API in Spark 1.3 with Scala. > I am trying to implement a UDF and am following the sample here: > > https://spark.apache.or

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records correspond

Re: Need a spark mllib tutorial

2015-04-02 Thread Reza Zadeh
Here's one: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html Reza On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) < pyada...@cisco.com> wrote: > Hi, > > > > I am new to the spark MLLib and I was browsing through the internet for > good tutorials ad

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
Hm, that will indeed be trickier because this method assumes records are the same byte size. Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once re

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread Michael Quinlan
I was able to hack this on my similar setup issue by running (on the driver) $ sudo hostname ip Where ip is the same value set in the "spark.driver.host" property. This isn't a solution I would use universally and hope the someone can fix this bug in the distribution. Regards, Mike -- View

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
yup a related JIRA is here https://issues.apache.org/jira/browse/SPARK-5113 which you might want to leave a comment in. This can be quite tricky we found ! but there are a host of env variable hacks you can use when launching spark masters/slaves. On Thu, Apr 2, 2015 at 5:18 PM, Michael Quinl

RE: Spark SQL. Memory consumption

2015-04-02 Thread java8964
It is hard to say what could be reason without more detail information. If you provide some more information, maybe people here can help you better. 1) What is your worker's memory setting? It looks like that your nodes have 128G physical memory each, but what do you specify for the worker's heap

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
Do you have a full stack trace? On Thu, Apr 2, 2015 at 11:45 AM, ogoh wrote: > > Hello, > My ETL uses sparksql to generate parquet files which are served through > Thriftserver using hive ql. > It especially defines a schema programmatically since the schema can be > only > known at runtime. > W

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Michael Armbrust
This is actually a problem with our use of Scala's reflection library. Unfortunately you need to load Spark SQL using the primordial classloader, otherwise you run into this problem. If anyone from the scala side can hint how we can tell scala.reflect which classloader to use when creating the mir

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INT INT(A) LONG (B) A INTs(Degrees) A SHORTINTs (Vertex_Attribute) B INTs B INTs B SHORTINTs B SHORTINTs A - number of vertices B - number of edges (no

Re: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Michael Armbrust
Thanks for reporting. The root cause is (SPARK-5632 ), which is actually pretty hard to fix. Fortunately, for this particular case there is an easy workaround: https://github.com/apache/spark/pull/5337 We can try to include this in 1.3.1. On Thu

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Tathagata Das
Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem wrote: > Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and > the worker nodes eventually run out of inodes. > We see tons of old shuf

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh
yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; at scala.reflect.internal.StdNames$CommonNames.(StdNames.scala:97) at scala.reflect.internal.StdNames$Keywords.(StdNames.scala:203)

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
This looks to me like you have incompatible versions of scala on your classpath? On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh wrote: > yes, below is the stacktrace. > Thanks, > Okehee > > java.lang.NoSuchMethodError: > scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; >

RE: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k,

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964
I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a

Re: Spark SQL does not read from cached table if table is renamed

2015-04-02 Thread Michael Armbrust
I'll add we just back ported this so it'll be included in 1.2.2 also. On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust wrote: > This is fixed in Spark 1.3. > https://issues.apache.org/jira/browse/SPARK-5195 > > On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash > wrote: > >> Hi all, >> >> >> >> Noticed

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread java8964
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark

  1   2   >