date:20150604

Error when job submitting to Rest URL of master

2015-06-04 Thread pavan kumar Kolamuri

Hi i am using spark 1.3.1 and i am trying to submit spark job to the rest url of spark ( spark::6066 using standalone cluster of spark with deploy mode as cluster . Both driver and application and getting submitted after doing their work (output created) both end up with killed status. In the work

Re: SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon

I figured it out. Needed a block style map and a check for null. The case class is just to name the transformed columns. case class Component(name: String, loadTimeMs: Long) avroFile.filter($"lazyComponents.components".isNotNull) .explode($"lazyComponents.components") { case Row(la

Re: PySpark with OpenCV causes python worker to crash

2015-06-04 Thread Davies Liu

Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ Could you also provide a way to reproduce this bug (including some datasets)? On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga wrote: > I've changed the SIFT feature extraction to SURF feature extraction and it > works... > > Fol

Re: PySpark with OpenCV causes python worker to crash

2015-06-04 Thread Sam Stoelinga

I've changed the SIFT feature extraction to SURF feature extraction and it works... Following line was changed: sift = cv2.xfeatures2d.SIFT_create() to sift = cv2.xfeatures2d.SURF_create() Where should I file this as a bug? When not running on Spark it works fine so I'm saying it's a spark bug.

Re: PySpark with OpenCV causes python worker to crash

2015-06-04 Thread Sam Stoelinga

Yea should have emphasized that. I'm running the same code on the same VM. It's a VM with spark in standalone mode and I run the unit test directly on that same VM. So OpenCV is working correctly on that same machine but when moving the exact same OpenCV code to spark it just crashes. On Tue, Jun

FetchFailed Exception

2015-06-04 Thread ๏̯͡๏

I see this Is this a problem with my code or the cluster ? Is there any way to fix it ? FetchFailed(BlockManagerId(2, phxdpehdc9dn2441.stratus.phx.ebay.com, 59574), shuffleId=1, mapId=80, reduceId=20, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to phxdpehdc9dn2441.st

Spark terminology

2015-06-04 Thread ๏̯͡๏

- I see these in my mapper only task. - - *Input Size / Records: *68.0 GB / 577195178 - *Shuffle write: *95.1 GB / 282559291 - *Shuffle spill (memory): *2.8 TB - *Shuffle spill (disk): *90.3 GB I understand the first one, can someone give 1/2 liners for the next three ? also tel

Why the default Params.copy doesn't work for Model.copy?

2015-06-04 Thread Justin Yip

Hello, I have a question with Spark 1.4 ml library. In the copy function, it is stated that the default implementation doesn't work of Params doesn't work for models. ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Model.scala#L49 ) As a result, some feature

RE: TF-IDF Question

2015-06-04 Thread Somnath Pandeya

Hi, org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) it is sparse vector representation of terms so the first term(1048576) is the length of vector [35587,884670] is the index of words [3.458767233,3.458767233] are the tf-idf values of the terms. Thanks S

RE: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"

2015-06-04 Thread Cheng, Hao

Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? -Original Message- From: ogoh [mailto:oke...@gmail.com] Sent: Friday, June 5, 2015 10:10 AM To: user@spark.apache.org Subject: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.

Re: Deduping events using Spark

2015-06-04 Thread William Briggs

Hi Lee, You should be able to create a PairRDD using the Nonce as the key, and the AnalyticsEvent as the value. I'm very new to Spark, but here is some uncompilable pseudo code that may or may not help: events.map(event => (event.getNonce, event)).reduceByKey((a, b) => a).map(_._2) The above cod

Column operation on Spark RDDs.

2015-06-04 Thread Carter

Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation is on columns, e.g., I need to create many intermediate variables from different columns, what is the most efficient way to do this? For example, if my dataRDD[Array[String]] is like below: 123, 523, 534, ..., 893

SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"

2015-06-04 Thread ogoh

Hello, I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1). Some udfs work fine (access array parameter and returning int or string type). But my udf returning map type throws an error: "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,co

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread Tathagata Das

But compile scope is supposed to be added to the assembly. https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 wrote: > Hi Iulian, > > On 26 May 2015, at 13:04, Iulian Dragoș > wrote: > > > > >

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg

Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas wrote: > Hi there, > > I would recommend checking out > https://github.com/spark-jobserver/spark-jobserver which I think gives > the functionality you are looking for. > I haven't tested it though. > > BR >

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau

My current example doesn't use a Hive UDAF, but you would do something pretty similar (it calls a new user defined UDAF, and there are wrappers to make Spark SQL UDAFs from Hive UDAFs but they are private). So this is doable, but since it pokes at internals it will likely break between versions of

sqlCtx.load a single big csv file from s3 in parallel

2015-06-04 Thread gy8

Hi there! I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the spark-csv package. I want to load this data in parallel utilizing all 20 executors that I have, however by default only 3 executors are being used (which downloaded 5gb/5gb/4gb). Here is my script (im using pys

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai

For the first round, you will have 16 reducers working since you have 32 partitions. Two of 32 partitions will know which reducer they will go by sharing the same key using reduceByKey. After this step is done, you will have 16 partitions, so the next round will be 8 reducers. Sincerely, DB Tsai

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Yiannis Gkoufas

Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it though. BR On 5 June 2015 at 01:35, Olivier Girardot wrote: > You can use it as a broadcast variable, but if it's "too" lar

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Olivier Girardot

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb I guess), you may need to share it joining this using some kind of key to the other RDDs. But this is the kind of thing broadcast variables were designed for. Regards, Olivier. Le jeu. 4 juin 2015 à 23:50, dgoldenberg

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Huang, Roger

Is the dictionary read-only? Did you look at http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables ? -Original Message- From: dgoldenberg [mailto:dgoldenberg...@gmail.com] Sent: Thursday, June 04, 2015 4:50 PM To: user@spark.apache.org Subject: How to share larg

How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread dgoldenberg

We have some pipelines defined where sometimes we need to load potentially large resources such as dictionaries. What would be the best strategy for sharing such resources among the transformations/actions within a consumer? Can they be shared somehow across the RDD's? I'm looking for a way to l

Error running sbt package on Windows 7 for Spark 1.3.1 and SimpleApp.scala

2015-06-04 Thread Joseph Washington

Hi all, I'm trying to run the standalone application SimpleApp.scala following the instructions on the http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala I was able to create a .jar file by doing sbt package. However when I tried to do $ YOUR_SPARK_HOME/bin/spark-submit

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread algermissen1971

Hi Iulian, On 26 May 2015, at 13:04, Iulian Dragoș wrote: > > On Tue, May 26, 2015 at 10:09 AM, algermissen1971 > wrote: > Hi, > > I am setting up a project that requires Kafka support and I wonder what the > roadmap is for Scala 2.11 Support (including Kafka). > > Can we expect to see 2.1

Re: TreeReduce Functionality in Spark

2015-06-04 Thread Raghav Shankar

Hey DB, Thanks for the reply! I still don't think this answers my question. For example, if I have a top() action being executed and I have 32 workers(32 partitions), and I choose a depth of 4, what does the overlay of intermediate reducers look like? How many reducers are there excluding the mas

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-04 Thread Yin Huai

Hi Doug, sqlContext.table does not officially support database name. It only supports table name as the parameter. We will add a method to support database name in future. Thanks, Yin On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog wrote: > Hi Yin, > I’m very surprised to hear that its not suppor

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-04 Thread Peter Rudenko

Hi Brandon, they are available, but private to ml package. They are now public in 1.4. For 1.3.1 you can define your transformer in org.apache.spark.ml package - then you could use these traits. Thanks, Peter Rudenko On 2015-06-04 20:28, Brandon Plaster wrote: Is "HasInputCol" and "HasOutputCo

Re: inlcudePackage() deprecated?

2015-06-04 Thread Daniel Emaasit

Got it. Ignore my similar question on Github comments. On Thu, Jun 4, 2015 at 11:48 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Yeah - We don't have support for running UDFs on DataFrames yet. There is > an open issue to track this > https://issues.apache.org/jira/browse/SPAR

Re: inlcudePackage() deprecated?

2015-06-04 Thread Shivaram Venkataraman

Yeah - We don't have support for running UDFs on DataFrames yet. There is an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817 Thanks Shivaram On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit wrote: > Hello Shivaram, > Was the includePackage() function deprecated in SparkR

Re: Required settings for permanent HDFS Spark on EC2

2015-06-04 Thread barmaley

Hi - I'm having similar problem with switching from ephemeral to persistent HDFS - it always looks for 9000 port regardless of options I set for 9010 persistent HDFS. Have you figured out a solution? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Req

Re: Deduping events using Spark

2015-06-04 Thread Richard Marscher

I think if you create a bidirectional mapping from AnalyticsEvent to another type that would wrap it and use the nonce as its equality, you could then do something like reduceByKey to group by nonce and map back to AnalyticsEvent after. On Thu, Jun 4, 2015 at 1:10 PM, lbierman wrote: > I'm still

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai

By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar wrote: > Hey Reza, > > Thanks for your response! > > Your response cl

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Saiph Kappa

Additionally, I think this document ( https://spark.apache.org/docs/latest/building-spark.html ) should mention that the protobuf.version might need to be changed to match the one used in the chosen hadoop version. For instance, with hadoop 2.7.0 I had to change protobuf.version to 1.5.0 to be able

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza

That might work, but there might also be other steps that are required. -Sandy On Thu, Jun 4, 2015 at 11:13 AM, Saiph Kappa wrote: > Thanks! It is working fine now with spark-submit. Just out of curiosity, > how would you use org.apache.spark.deploy.yarn.Client? Adding that > spark_yarn jar to

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Saiph Kappa

Thanks! It is working fine now with spark-submit. Just out of curiosity, how would you use org.apache.spark.deploy.yarn.Client? Adding that spark_yarn jar to the configuration inside the application? On Thu, Jun 4, 2015 at 6:37 PM, Vova Shelgunov wrote: > You should run it with spark-submit or u

Re: TreeReduce Functionality in Spark

2015-06-04 Thread Raghav Shankar

Hey Reza, Thanks for your response! Your response clarifies some of my initial thoughts. However, what I don't understand is how the depth of the tree is used to identify how many intermediate reducers there will be, and how many partitions are sent to the intermediate reducers. Could you provide

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Mohammed Guller

Deenar, Thanks for the suggestion. That is one of the ideas that I have, but didn’t get chance to try it out yet. One of the things that could potentially cause problems is that we use wide rows. In addition, the schema is dynamic, with new columns getting added on a regular basis. That is why I

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread barmaley

The issue was that SSH key generated on Spark Master was not transferred to this new slave. Spark-ec2 script with `start` command omits this step. The solution is to use `launch` command with `--resume` options. Then the SSH key is transferred to the new slave and everything goes smooth. -- View

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza

spark-submit is the recommended way of launching Spark applications on YARN, because it takes care of submitting the right jars as well as setting up the classpath and environment variables appropriately. -Sandy On Thu, Jun 4, 2015 at 10:30 AM, Saiph Kappa wrote: > No, I am not. I run it with s

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-04 Thread Brandon Plaster

Is "HasInputCol" and "HasOutputCol" available in 1.3.1? I'm getting the following message when I'm trying to implement a Transformer and importing org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}: error: object shared is not a member of package org.apache.spark.ml.param and error: tr

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Saiph Kappa

No, I am not. I run it with sbt «sbt "run-main Branchmark"». I thought it was the same thing since I am passing all the configurations through the application code. Is that the problem? On Thu, Jun 4, 2015 at 6:26 PM, Sandy Ryza wrote: > Hi Saiph, > > Are you launching using spark-submit? > > -S

Re: TreeReduce Functionality in Spark

2015-06-04 Thread Reza Zadeh

In a regular reduce, all partitions have to send their reduced value to a single machine, and that machine can become a bottleneck. In a treeReduce, the partitions talk to each other in a logarithmic number of rounds. Imagine a binary tree that has all the partitions at its leaves and the root wil

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza

Hi Saiph, Are you launching using spark-submit? -Sandy On Thu, Jun 4, 2015 at 10:20 AM, Saiph Kappa wrote: > Hi, > > I've been running my spark streaming application in standalone mode > without any worries. Now, I've been trying to run it on YARN (hadoop 2.7.0) > but I am having some problems

How to run spark streaming application on YARN?

2015-06-04 Thread Saiph Kappa

Hi, I've been running my spark streaming application in standalone mode without any worries. Now, I've been trying to run it on YARN (hadoop 2.7.0) but I am having some problems. Here are the config parameters of my application: « val sparkConf = new SparkConf() sparkConf.setMaster("yarn-client"

Re: Spark ML decision list

2015-06-04 Thread Reza Zadeh

The closest algorithm to decision lists that we have is decision trees https://spark.apache.org/docs/latest/mllib-decision-tree.html On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri wrote: > Hi, > > I have used weka machine learning library for generating a model for my > training set. I have used

Re: Spark Job always cause a node to reboot

2015-06-04 Thread Ruslan Dautkhanov

vm.swappiness=0? Some vendors recommend this set to 0 (zero), although I've seen this causes even kernel to fail to allocate memory. It may cause node reboot. If that's the case, set vm.swappiness to 5-10 and decrease spark.*.memory. Your spark.driver.memory+ spark.executor.memory + OS + etc >> amo

Deduping events using Spark

2015-06-04 Thread lbierman

I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events. I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDD events = ((JavaRDD>) context.newAPIHadoopRDD( context.hadoopC

Re: Standard Scaler taking 1.5hrs

2015-06-04 Thread Holden Karau

take(5) will only evaluate enough partitions to provide 5 elements (sometimes a few more but you get the idea), so it won't trigger a full evaluation of all partitions unlike count(). On Thursday, June 4, 2015, Piero Cinquegrana wrote: > Hi DB, > > > > Yes I am running count() operations on the

RE: Standard Scaler taking 1.5hrs

2015-06-04 Thread Piero Cinquegrana

Hi DB, Yes I am running count() operations on the previous steps and it appears that something is slow prior to the scaler. I thought that running take(5) and print the results would execute the command at each step and materialize the RDD, but is that not the case? That’s how I was testing eac

TF-IDF Question

2015-06-04 Thread franco barrientos

Hi all!, I have a .txt file where each row of it it¹s a collection of terms of a document separated by space. For example: 1 "Hola spark² 2 .. I followed this example of spark site https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get something like this: tfidf.first() or

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-04 Thread Marcelo Vanzin

I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake wrote: > As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that > Spa

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Richard Marscher

It is possible to start multiple concurrent drivers, Spark dynamically allocates ports per "spark application" on driver, master, and workers from a port range. When you collect results back to the driver, they do not go through the master. The master is mostly there as a coordinator between the dr

Re: importerror using external library with pyspark

2015-06-04 Thread Don Drake

I would try setting PYSPARK_DRIVER_PYTHON environment variable to the location of your python binary, especially if you are using a virtual environment. -Don On Wed, Jun 3, 2015 at 8:24 PM, AlexG wrote: > I have libskylark installed on both machines in my two node cluster in the > same location

Re: Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread James Aley

Thanks for the confirmation! We're quite new to Spark, so a little reassurance is a good thing to have sometimes :-) The thing that's concerning me at the moment is that my job doesn't seem to run any faster with more compute resources added to the cluster, and this is proving a little tricky to d

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Igor Berman

Hi, as far as I understand you shouldn't send data to driver. Suppose you have file in hdfs/s3 or cassandra partitioning, you should create your job such that every executor/worker of spark will handle part of your input, transform, filter it and at the end write back to cassandra as output(once ag

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Yin Huai

Are you using RC4? On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf wrote: > Thanks Yin, that seems to work with the Shell. But on a compiled > application with Spark-submit it still fails with the same exception. > > On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai wrote: > >> Can you put the following set

Setting S3 output file grantees for spark output files

2015-06-04 Thread Justin Steigel

Hi all, I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files using rdd.saveAsTextFile(''). In hive, I would add a line in the beginning of the script with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the correct grantees f

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-04 Thread Doug Balog

Hi Yin, I’m very surprised to hear that its not supported in 1.3 because I’ve been using it since 1.3.0. It worked great up until SPARK-6908 was merged into master. What is the supported way to get DF for a table that is not in the default database ? IMHO, If you are not going to support “da

How to speed up Spark Job?

2015-06-04 Thread ๏̯͡๏

I have a spark app that reads avro & sequence file data and performs join, reduceByKey Results: Command for all runs: ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/ha

Spark Job always cause a node to reboot

2015-06-04 Thread Chao Chen

Hi all, I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory. One is configured as headnode running masters, and 3 others are workers But when I try to run the Pagerank from HiBench, it always cause a node to reb

Re: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Deenar Toraskar

Mohammed Have you tried registering your Cassandra tables in Hive/Spark SQL using the data frames API. These should be then available to query via the Spark SQL/Thrift JDBC Server. Deenar On 1 June 2015 at 19:33, Mohammed Guller wrote: > Nobody using Spark SQL JDBC/Thrift server with DSE Cass

Scaling spark jobs returning large amount of data

2015-06-04 Thread Giuseppe Sarno

Hello, I am relatively new to spark and I am currently trying to understand how to scale large numbers of jobs with spark. I understand that spark architecture is split in "Driver", "Master" and "Workers". Master has a standby node in case of failure and workers can scale out. All the examples I

Big performance difference when joining 3 tables in different order

2015-06-04 Thread Hao Ren

Hi, I encountered a performance issue when join 3 tables in sparkSQL. Here is the query: SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt FROM t_category c, t_zipcode z, click_meter_site_grouped g WHERE c.refCategoryID = g.category AND z.regionCode = g.region I need to pay a

Re: Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread Eugen Cepoi

Hi 2015-06-04 15:29 GMT+02:00 James Aley : > Hi, > > We have a load of Avro data coming into our data systems in the form of > relatively small files, which we're merging into larger Parquet files with > Spark. I've been following the docs and the approach I'm taking seemed > fairly obvious, and

Re: Spark 1.3.1 On Mesos Issues.

2015-06-04 Thread John Omernik

So a few updates. When I run local as stated before, it works fine. When I run in Yarn (via Apache Myriad on Mesos) it also runs fine. The only issue is specifically with Mesos. I wonder if there is some sort of class path goodness I need to fix or something along that lines. Any tips would be ap

Re: Difference bewteen library dependencies version

2015-06-04 Thread Ted Yu

For your first question, please take a look at HADOOP-9922. The fix is in hadoop-common module. Cheers On Thu, Jun 4, 2015 at 2:53 AM, Jean-Charles RISCH < risch.jeanchar...@gmail.com> wrote: > Hello, > *(Before everything : I use IntellijIdea 14.0.1, SBT and Scala 2.11.6)* > > This morning, I w

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Cody Koeninger

direct stream isn't a receiver, it isn't required to cache data anywhere unless you want it to. If you want it, just call cache. On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg wrote: > "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it > appears the storage level can be sp

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Deenar Toraskar

Hi Holden, Olivier >>So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 201

Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread James Aley

Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's not

Re: Adding an indexed column

2015-06-04 Thread Deenar Toraskar

or you could 1) convert dataframe to RDD 2) use mapPartitions and zipWithIndex within each partition 3) convert RDD back to dataframe you will need to make sure you preserve partitioning Deenar On 1 June 2015 at 02:23, ayan guha wrote: > If you are on spark 1.3, use repartitionandSort followed

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg

"set the storage policy for the DStream RDDs to MEMORY AND DISK" - it appears the storage level can be specified in the createStream methods but not createDirectStream... On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov wrote: > You can also try Dynamic Resource Allocation > > > > > https://spark.a

Re: Transactional guarantee while saving DataFrame into a DB

2015-06-04 Thread Deenar Toraskar

Hi Tariq You need to handle the transaction semantics yourself. You could for example save from the dataframe to a staging table and then write to the final table using a single atomic "INSERT INTO finalTable from stagingTable" call. Remember to clear the staging table first to recover from previo

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg

Shixiong, Thanks, interesting point. So if we want to only process one batch then terminate the consumer, what's the best way to achieve that? Presumably the listener could set a flag on the driver notifying it that it can terminate. But the driver is not in a loop, it's basically blocked in await

Re: How to create fewer output files for Spark job ?

2015-06-04 Thread ๏̯͡๏

It worked. On Thu, Jun 4, 2015 at 5:14 PM, MEETHU MATHEW wrote: > Try using coalesce > > Thanks & Regards, > Meethu M > > > > On Wednesday, 3 June 2015 11:26 AM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" > wrote: > > > I am running a series of spark functions with 9000 executors and its > resulting in 9000+ files that

Re: How to create fewer output files for Spark job ?

2015-06-04 Thread MEETHU MATHEW

Try using coalesce Thanks & Regards, Meethu M On Wednesday, 3 June 2015 11:26 AM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: I am running a series of spark functions with 9000 executors and its resulting in 9000+ files that is execeeding the namespace file count qutota. How can Spark be configured to

large shuffling => executor lost?

2015-06-04 Thread Yifan LI

Hi, I am running my graphx application with Spark 1.3.1 on a small cluster. Then it failed on this exception: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 But actually I found it is caused by “ExecutorLostFailure” indeed, and someone told it

SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon

Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false) |||-- name: string (nullable = false) |||-- loadT

inlcudePackage() deprecated?

2015-06-04 Thread Daniel Emaasit

Hello Shivaram, Was the includePackage() function deprecated in SparkR 1.4.0? I don't see it in the documentation? If it was, does that mean that we can use R packages on Spark DataFrames the usual way we do for local R dataframes? Daniel -- Daniel Emaasit Ph.D. Research Assistant Transportation

Difference bewteen library dependencies version

2015-06-04 Thread Jean-Charles RISCH

Hello, *(Before everything : I use IntellijIdea 14.0.1, SBT and Scala 2.11.6)* This morning, I was looking to resolve the "Failed to locate the winutils binary in the hadoop binary path" error. I noticed that I can solve it configuring my build.sbt to "... libraryDependencies += "org.apache.ha

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-04 Thread luke89

Hi Matei, thank you for answering. Accordingly to what you said, am I mistaken when I say that tuples with the same key might eventually be spread across more than one node in case an overloaded worker can no longer accept tuples? In other words, suppose a worker (processing key K) cannot accept

Spark ML decision list

2015-06-04 Thread Sateesh Kavuri

Hi, I have used weka machine learning library for generating a model for my training set. I have used the PART algorithm (decision lists) from weka. Now, I would like to use spark ML for the PART algo for my training set and could not seem to find a parallel. Could anyone point out the correspond

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread Akhil Das

That's because you need to add the master's public key (~/.ssh/id_rsa.pub) to the newly added slaves ~/.ssh/authorized_keys. I add slaves this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master p

Re: Python Image Library and Spark

2015-06-04 Thread Akhil Das

Replace this line: img_data = sc.parallelize( list(im.getdata()) ) With: img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have ) Thanks Best Regards On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur wrote: > Hi all, > > I'm playing around with manipulating images via Pyth

Re: StreamingListener, anyone?

2015-06-04 Thread Shixiong Zhu

You should not call `jssc.stop(true);` in a StreamingListener. It will cause a dead-lock: `jssc.stop` won't return until `listenerBus` exits. But since `jssc.stop` blocks `StreamingListener`, `listenerBus` cannot exit. Best Regards, Shixiong Zhu 2015-06-04 0:39 GMT+08:00 dgoldenberg : > Hi, > >

NullPointerException SQLConf.setConf

2015-06-04 Thread patcharee

Hi, I am using Hive 0.14 and spark 0.13. I got java.lang.NullPointerException when inserted into hive. Any suggestions please. hiveContext.sql("INSERT OVERWRITE table 4dim partition (zone=" + ZONE + ",z=" + zz + ",year=" + YEAR + ",month=" + MONTH + ") " + "select date, hh, x, y, hei

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-04 Thread Ji ZHANG

Hi, I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this setting can be seen in web ui's environment tab. But, it still eats memory, i.e. -Xmx set to 512M but RES grows to 1.5G in half a day. On Wed, Jun 3, 2015 at 12:02 PM, Shixiong Zhu wrote: > Could you set "spark.shuffle.

Re: StreamingListener, anyone?

2015-06-04 Thread Akhil Das

Hi Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183 [image: Inline image 1] Thanks Best Regards On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg wrote: > Hi, > > I've got a Spark Streaming driver job implemented and in it, I register a > streaming listener, like so: >

87 matches

Mail list logo