Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-04-07 Thread Wojciech Indyk
Hello Divya! Have you solved the problem? I suppose the log comes from driver. You need to look also at logs on worker JVMs, there can be an exception or something. Do you have Kerberos on your cluster? It could be similar to a problem http://issues.apache.org/jira/browse/SPARK-14115 Based on your

About nested RDD

2016-04-07 Thread Tenghuan He
Hi all, I know that nested RDDs are not possible like linke rdd1.map(x => x + rdd2.count()) I tried to create a custome RDD like following class MyRDD(base: RDD, part: Partitioner) extends RDD[(K, V)] { var rdds = new ArrayBuffer.empty[RDD[(K, (V, Int))]] def update(rdd: RDD[_]) { udds += rdd

how to use udf in spark thrift server.

2016-04-07 Thread zhanghn
I want to define some UDFs in my spark ENV. And server it in thrift server. So I can use these UDFs in my beeline connection. At first I tried start it with udf-jars and create functions in hive. In spark-sql , I can add temp functions like "CREATE TEMPORARY FUNCTION bsdUpper AS 'org.hue.udf

[HELP:]Save Spark Dataframe in Phoenix Table

2016-04-07 Thread Divya Gehlot
Hi, I hava a Hortonworks Hadoop cluster having below Configurations : Spark 1.5.2 HBASE 1.1.x Phoenix 4.4 I am able to connect to Phoenix through JDBC connection and able to read the Phoenix tables . But while writing the data back to Phoenix table I am getting below error : org.apache.spark.sql.

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread ashesh_28
Hi , I am also attaching a screenshot of my ResourceManager UI which shows the available cores and memory allocated for each node , -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread ashesh_28
Hi Guys , Thanks for your valuable inputs , I have tried few alternatives as suggested but it all leads me to same result - Unable to start Spark Context @Dhiraj Peechara I am able to start my spark SC(SparkContext) in stand-alone mode by just issuing the *$spark-shell* command from the termina

Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Mich Talebzadeh
If you are using hiveContext to create a Hive database it will work. In general you should use Hive to create a Hive database and create tables within the already existing Hive database from Spark. Make sure that you qualify with sql("DROP TABLE IF EXISTS accounts.ll_18740868") var sqltext : S

MLlib ALS MatrixFactorizationModel.save fails consistently

2016-04-07 Thread Colin Woodbury
Hi all, I've implemented most of a content recommendation system for a client. However, whenever I attempt to save a MatrixFactorizationModel I've trained, I see one of four outcomes: 1. Despite "save" being wrapped in a "try" block, I see a massive stack trace quoting some java.io classes. The M

Re: ordering over structs

2016-04-07 Thread Imran Akbar
thanks Michael, I'm trying to implement the code in pyspark like so (where my dataframe has 3 columns - customer_id, dt, and product): st = StructType().add("dt", DateType(), True).add("product", StringType(), True) top = data.select("customer_id", st.alias('vs')) .groupBy("customer_id") .a

Re: Anyone have a tutorial or guide to implement Spark + AWS + Caffe/CUDA?

2016-04-07 Thread jamborta
Hi Alfredo, I have been building something similar and found that EMR is not suitable for this, as the gpu instances don't come with nvidia drivers (and the bootstrap process does not allow to restart instances). The way I'm setting up is based on the spark-ec2 script where you can use custom AM

Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Xiao Li
Hi, Assuming you are using 1.6 or before, this is a native Hive command. Basically, the execution of Database creation is completed by Hive. Thanks, Xiao Li 2016-04-07 15:23 GMT-07:00 antoniosi : > Hi, > > I am using hiveContext.sql("create database if not exists ") to > create a hive db. Is

Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread antoniosi
Hi, I am using hiveContext.sql("create database if not exists ") to create a hive db. Is this statement atomic? Thanks. Antonio. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Hive-CREATE-DATABASE-IF-NOT-EXISTS-atomic-tp26706.html Sent from the Apache

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread Ted Yu
Which Spark release are you using ? Have you registered to all the events provided by SparkListener ? If so, can you do event-wise summation of execution time ? Thanks On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge wrote: > We are running a batch job with the following specifications > •

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread JasmineGeorge
The logs are self explanatory. It says "java.io.IOException: Incomplete HDFS URI, no host: hdfs:/user/hduser/share/lib/spark-assembly.jar" you need to specify the host in the above hdfs url. It should look something like the following: hdfs://:8020/user/hduser/share/lib/spark-assembly.jar -

Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread JasmineGeorge
We are running a batch job with the following specifications • Building RandomForest with config : maxbins=100, depth=19, num of trees = 20 • Multiple runs with different input data size 2.8 GB, 10 Million records • We are running spark application on Yarn in cluster mode, with 3

Working with zips in pyspark

2016-04-07 Thread tminima
I have n zips in a directory and I want to extract each one of those and then get some data out of a file or two lying inside the zips and add it to a graph DB. All of my zips are in a HDFS directory. I am thinking my code should be along these lines. # Names of all my zips zip_names = ["

Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Nirmal Manoharan
Hi Greg, I use something similar to this in my application but not for empty string. So the below example is not tested but it should work. JavaRDD filteredJavaRDD = example.filter(new Function(){ public Boolean call(String arg0) throws Exception { return (!arg0.equals("")); } }); Thanks! Nirmal

RE: mapWithState not compacting removed state

2016-04-07 Thread Iain Cundy
Hi Ofir I've discovered compaction works in 1.6.0 if I switch off Kryo. I was using a workaround to get around mapWithState not supporting Kryo. See https://issues.apache.org/jira/browse/SPARK-12591 My custom KryoRegistrator Java class has // workaround until bug fixes in spark 1.6.1 kryo.regi

Re: HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Nick Pentreath
You're right Sean, the implementation depends on hash code currently so may differ. I opened a JIRA (which duplicated this one - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574 which is the active JIRA), for using murmurhash3 which should then be consistent across platforms

Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Ted Yu
This is the version of Kafka Spark depends on: [INFO] +- org.apache.kafka:kafka_2.10:jar:0.8.2.1:compile On Thu, Apr 7, 2016 at 9:14 AM, Haroon Rasheed wrote: > Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0" > compile. I guess the internal dependencies are automatic

Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Haroon Rasheed
Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0" compile. I guess the internal dependencies are automatically pulled when you add spark-streaming-kafka_2.10. Also try changing the version to 1.6.1 or lower. Just to see if the links are broken. Regards, Haroon Syed On 7

building kafka project on intellij Help is much appreciated

2016-04-07 Thread Sudhanshu Janghel
Hello, I am new to building kafka and wish to understand how to make fat jars in intellij. The sbt assembly seems confusing and I am unable to resolve the dependencies. here is my build.sbt name := "twitter" version := "1.0" scalaVersion := "2.10.4" //libraryDependencies += "org.slf4j" % "sl

Re: Dataframe to parquet using hdfs or parquet block size

2016-04-07 Thread Buntu Dev
I tried setting both the hdfs and parquet block size but write to parquet did not seem to have had any effect on the total number of blocks or the average block size. Here is what I did: sqlContext.setConf("dfs.blocksize", "134217728") sqlContext.setConf("parquet.block.size", "134217728") sql

Re: mapWithState not compacting removed state

2016-04-07 Thread Ofir Kerker
Hi Iain, Did you manage to solve this issue? It looks like we have a similar issue with processing time increasing every micro-batch but only after 30 batches. Thanks. On Thu, Mar 3, 2016 at 4:45 PM Iain Cundy wrote: > Hi All > > > > I’m aggregating data using mapWithState with a timeout set in

HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Sean Owen
Let's say I use HashingTF in my Pipeline to hash a string feature. This is available in Python and Scala, but they hash strings to different values since both use their respective runtime's native hash implementation. This means that I create different feature vectors for the same input. While I ca

Re: Spark on Mobile platforms

2016-04-07 Thread Luciano Resende
Take a look at Apache Quarks, it is more towards what you are looking for and has the ability to integrate with Spark. http://quarks.apache.org/ On Thu, Apr 7, 2016 at 4:50 AM, sakilQUB wrote: > Hi all, > > I have been trying to find if Spark can be run on a mobile device platform > (Android pr

Re: How to process one partition at a time?

2016-04-07 Thread Andrei
Thanks everyone, both - `submitJob` and `PartitionPrunningRDD` - work for me. On Thu, Apr 7, 2016 at 8:22 AM, Hemant Bhanawat wrote: > Apparently, there is another way to do it. You can try creating a > PartitionPruningRDD and pass a partition filter function to it. This RDD > will do the same t

Re: Spark on Mobile platforms

2016-04-07 Thread Michael Slavitch
You should consider mobile agents that feed data into a spark datacenter via spark streaming. > On Apr 7, 2016, at 8:28 AM, Ashic Mahtab wrote: > > Spark may not be the right tool for this. Working on just the mobile device, > you won't be scaling out stuff, and as such most of the benefits o

Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Chris Miller
flatmap? -- Chris Miller On Thu, Apr 7, 2016 at 10:25 PM, greg huang wrote: > Hi All, > >Can someone give me a example code to get rid of the empty string in > JavaRDD? I kwon there is a filter method in JavaRDD: > https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html#

How to remove empty strings from JavaRDD

2016-04-07 Thread greg huang
Hi All, Can someone give me a example code to get rid of the empty string in JavaRDD? I kwon there is a filter method in JavaRDD: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html#filter(scala.Function1) Regards, Greg

RE: Spark on Mobile platforms

2016-04-07 Thread Ashic Mahtab
Spark may not be the right tool for this. Working on just the mobile device, you won't be scaling out stuff, and as such most of the benefits of Spark would be nullified. Moreover, it'd likely run slower than things that are meant to work in a single process. Spark is also quite large, which is

difference between simple streaming and windows streaming in spark

2016-04-07 Thread Ashok Kumar
Is simple streaming mean continuous streaming and windows streaming time window? val ssc = new StreamingContext(sparkConf, Seconds(10)) thanks

Spark on Mobile platforms

2016-04-07 Thread sakilQUB
Hi all, I have been trying to find if Spark can be run on a mobile device platform (Android preferably) to analyse mobile log data for some performance analysis. So, basically the idea is to collect and process the mobile log data within the mobile device using the Spark framework to allow real-ti

Re: partition an empty RDD

2016-04-07 Thread Tenghuan He
Thanks for your response Owen:) Yes, I define K as ClassTag type and it works. Sorry for bothering. On Thu, Apr 7, 2016 at 4:07 PM, Sean Owen wrote: > It means pretty much what it says. Your code does not have runtime > class info about K at this point in your code, and it is required. > > On Th

Research issues in Spark, Spark Streamming and MLlib

2016-04-07 Thread C.H
Hello, I have a research project and I will be working on Spark to build something on top of it or try to improve it. I would like to know what are the research issues that we still have in Spark Streamming , Mllib or Spark inorder to improve it. Thanks in advance -- View this message in contex

Develop locally with Yarn

2016-04-07 Thread Natu Lauchande
Hi, I working on a spark streamming app , when in local i use the "local[*]" as the master of my Spark Streamming Context . I wonder what would be need to develop locally and run it in Yarn through the IDE i am using IntelliJ idea. Thanks, Natu

Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-07 Thread jamborta
depends, if you'd like to multiply matrices for each row in the data, then you could use a breeze matrix, and do that locally on the nodes in a map or similar. if you'd like to multiply them across the rows, eg. a row in your data is a row in the matrix, then you could use a distributed matrix lik

Re: partition an empty RDD

2016-04-07 Thread Sean Owen
It means pretty much what it says. Your code does not have runtime class info about K at this point in your code, and it is required. On Thu, Apr 7, 2016 at 5:52 AM, Tenghuan He wrote: > Hi all, > > I want to create an empty rdd and partition it > > val buffer: RDD[(K, (V, Int))] = base.context.