Re: spark-submit conflicts with dependencies

2015-01-28 Thread Sean Owen
Normally, if this were all in one app, Maven would have solved the problem for you by choosing 1.8 over 1.6. You do not need to exclude anything; Maven does it for you. Here the problem is that 1.8 is in the app but the server (Spark) uses 1.6. This is what the userClassPathFirst setting is for, t

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2015-01-28 Thread Marco
I've switched to maven and all issues are gone, now. 2015-01-23 12:07 GMT+01:00 Sean Owen : > Use mvn dependency:tree or sbt dependency-tree to print all of the > dependencies. You are probably bringing in more servlet API libs from > some other source? > > On Fri, Jan 23, 2015 at 10:57 AM, Marco

Conflict between elasticsearch-spark and elasticsearch-hadoop jars

2015-01-28 Thread aarthi
Hi We have a maven project which supports running of spark jobs and pig jobs. But I could use only either one of elasticsearch-hadoop or elasticsearch-spark jars at a time.If I use both jars together, I get conflict in org.elasticsearch.hadoop.cfg.SettingsManager which is presnt as class in elasti

How to unregister/re-register a TempTable in Spark?

2015-01-28 Thread shahab
Hi, I just wonder if there is any way to unregister/re-register a TempTable in Spark? best, /Shahab

Re: performance of saveAsTextFile moving files from _temporary

2015-01-28 Thread Aaron Davidson
Upon completion of the 2 hour part of the run, the files did not exist in the output directory? One thing that is done serially is deleting any remaining files from _temporary, so perhaps there was a lot of data remaining in _temporary but the committed data had already been moved. I am, unfortuna

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
hi, thanks for the quick answer -- I suppose this is possible, though I don't understand how it could come about. The largest individual RDD elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of 800k of them. The file is saved in 134 parts, but is being read in using some 1916+

Issue with SparkContext in cluster

2015-01-28 Thread Marco
I've created a spark app, which runs fine if I copy the corresponding jar to the hadoop-server (where yarn is running) and submit it there. If it try it to submit it from my local machine, I get the error which I've attached below. Submit cmd: "spark-submit.cmd --class ExamplesHadoop.SparkHbase.Tr

ETL process design

2015-01-28 Thread Danny Yates
Hi, My apologies for what has ended up as quite a long email with a lot of open-ended questions, but, as you can see, I'm really struggling to get started and would appreciate some guidance from people with more experience. I'm new to Spark and "big data" in general, and I'm struggling with what I

Re: ETL process design

2015-01-28 Thread Stadin, Benjamin
Hi Danny, What you describe sounds like you may also consider to use Spring XD instead, at least for the file-centric stuff. Regards Ben Von meinem iPad gesendet > Am 28.01.2015 um 10:42 schrieb Danny Yates : > > Hi, > > My apologies for what has ended up as quite a long email with a lot of

Re: performance of saveAsTextFile moving files from _temporary

2015-01-28 Thread Thomas Demoor
TLDR Extend FileOutPutCommitter to eliminate the temporary_storage. There are some implementations to be found online, typically called DirectOutputCommitter, f.i. this spark pull request . Tell Spark to use you

Re: Issue with SparkContext in cluster

2015-01-28 Thread Shixiong Zhu
It's because you committed the job in Windows to a Hadoop cluster running in Linux. Spark has not yet supported it. See https://issues.apache.org/jira/browse/SPARK-1825 Best Regards, Shixiong Zhu 2015-01-28 17:35 GMT+08:00 Marco : > I've created a spark app, which runs fine if I copy the corresp

Kryo buffer overflows

2015-01-28 Thread Tristan Blakers
A search shows several historical threads for similar Kryo issues, but none seem to have a definitive solution. Currently using Spark 1.2.0. While collecting/broadcasting/grouping moderately sized data sets (~500MB - 1GB), I regularly see exceptions such as the one below. I’ve tried increasing th

Running a task over a single input

2015-01-28 Thread Matan Safriel
Hi, How would I run a given function in Spark, over a single input object? Would I first add the input to the file system, then somehow invoke the Spark function on just that input? or should I rather twist the Spark streaming api for it? Assume I'd like to run a piece of computation that normall

Re: Running a task over a single input

2015-01-28 Thread Sean Owen
Processing one object isn't a distributed operation, and doesn't really involve Spark. Just invoke your function on your object in the driver; there's no magic at all to that. You can make an RDD of one object and invoke a distributed Spark operation on it, but assuming you mean you have it on the

Re: Conflict between elasticsearch-spark and elasticsearch-hadoop jars

2015-01-28 Thread Costin Leau
That indicates that you are using two different versions of es-hadoop (2.0.x) and es-spark (2.1.x) Have you considered aligning the two versions? On 1/28/15 11:08 AM, aarthi wrote: Hi We have a maven project which supports running of spark jobs and pig jobs. But I could use only either one of

Re: Issues with constants in Spark HiveQL queries

2015-01-28 Thread Pala M Muthaia
By typo i meant that the column name had a spelling error: conversion_aciton_id. It should have been conversion_action_id. No, we tried it a few times, and we didn't have + signs or anything like that - we tried it with columns of different types too - string, double etc and saw the same error.

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-28 Thread siddardha
Then your spark is not built for yarn. Try to build with sbt/sbt -Dhadoop.version=2.3.0 -Pyarn assembly -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21404.html Sent from the Apache Spar

RDD caching, memory & network input

2015-01-28 Thread Andrianasolo Fanilo
Hello Spark fellows :), I think I need some help to understand how .cache and task input works within a job. I have an 7 GB input matrix in HDFS that I load using .textFile(). I also have a config file which contains an array of 12 Logistic Regression Model parameters, loaded as an Array[Strin

Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/sc

2015-01-28 Thread Emre Sevinc
Hello, I'm using *Spark 1.1.0* and *Solr 4.10.3*. I'm getting an exception when using *HttpSolrServer* from within Spark Streaming: 15/01/28 13:42:52 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSyste

Percentile Calculation

2015-01-28 Thread kundan kumar
Is there any inbuilt function for calculating percentile over a dataset ? I want to calculate the percentiles for each column in my data. Regards, Kundan

Re: Percentile Calculation

2015-01-28 Thread Kohler, Curt E (ELS-STL)
When I looked at this last fall, the only way that seemed to be available was to transform my data into SchemaRDDs, register them as tables and then use the Hive processor to calculate them with its built in percentile UDFs that were added in 1.2. Curt From: k

Re: HW imbalance

2015-01-28 Thread simon elliston ball
You shouldn’t have any issues with differing nodes on the latest Ambari and Hortonworks. It works fine for mixed hardware and spark on yarn. Simon > On Jan 26, 2015, at 4:34 PM, Michael Segel wrote: > > If you’re running YARN, then you should be able to mix and max where YARN is > managing t

Re: Running a task over a single input

2015-01-28 Thread Matan Safriel
Thanks! So I assume I can safely run a function *F* of mine within the spark driver program, without dispatching it to the cluster (?), thereby sticking to one piece of code for *both* a real cluster run over big data, and for small on-demand runs for a single input (now and then), both scenarios

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
I deal with problems like this so often across Java applications with large dependency trees. Add the shell function at the following link to your shell on the machine where your Spark Streaming is installed: https://gist.github.com/cfeduke/fe63b12ab07f87e76b38 Then run in the directory where you

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
In all of the soutions I've found thus far, sorting has been by casting the partition iterator into an array and sorting the array. This is not going to work for my case as the amount of data in each partition may not necessarily fit into memory. Any ideas? On Wed, Jan 28, 2015 at 1:29 AM, Corey N

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet wrote: > In all of the soutions I've found thus far, sorting has been by casting > the

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Emre Sevinc
This is what I get: ./bigcontent-1.0-SNAPSHOT.jar:org/apache/http/impl/conn/SchemeRegistryFactory.class (probably because I'm using a self-contained JAR). In other words, I'm still stuck. -- Emre On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke wrote: > I deal with problems like this so of

RE: spark 1.2 ec2 launch script hang

2015-01-28 Thread ey-chih chow
We found the problem and already fixed it. Basically, spark-ec2 requires ec2 instances to have external ip addresses. You need to specify this in the ASW console. From: nicholas.cham...@gmail.com Date: Tue, 27 Jan 2015 17:19:21 + Subject: Re: spark 1.2 ec2 launch script hang To: charles.f

Set is not parseable as row field in SparkSql

2015-01-28 Thread Jorge Lopez-Malla
Hello, We are trying to insert a case class in Parquet using SparkSql. When i'm creating the SchemaRDD, that include a Set, i have the following exception: sqc.createSchemaRDD(r) scala.MatchError: Set[(scala.Int, scala.Int)] (of class scala.reflect.internal.Types$TypeRef$$anon$1) at org.apache.sp

Re: Spark and S3 server side encryption

2015-01-28 Thread Kohler, Curt E (ELS-STL)
So, following up on your suggestion, I'm still having some problems getting the configuration changes recognized when my job run. I've added jets3t.properties to the root of my application jar file that I submit to Spark (via spark-submit). I've verified that my jets3t.properties is at the root

Re: Running a task over a single input

2015-01-28 Thread Sean Owen
On Wed, Jan 28, 2015 at 1:44 PM, Matan Safriel wrote: > So I assume I can safely run a function F of mine within the spark driver > program, without dispatching it to the cluster (?), thereby sticking to one > piece of code for both a real cluster run over big data, and for small > on-demand runs

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
It looks like you're shading in the Apache HTTP commons library and its a different version than what is expected. (Maybe 4.6.x based on the Javadoc.) I see you are attempting to exclude commons-httpclient by using: commons-httpclient * in your pom. However,

Re: RDD caching, memory & network input

2015-01-28 Thread Sandy Ryza
Hi Fanilo, How many cores are you using per executor? Are you aware that you can combat the "container is running beyond physical memory limits" error by bumping the spark.yarn.executor.memoryOverhead property? Also, are you caching the parsed version or the text? -Sandy On Wed, Jan 28, 2015 a

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Emre Sevinc
When I examine the dependencies again, I see that SolrJ library is using v. 4.3.1 of org.apache.httpcomponents:httpclient [INFO] +- org.apache.solr:solr-solrj:jar:4.10.3:compile [INFO] | +- org.apache.httpcomponents:httpclient:jar:4.3.1:compile <<<== [INFO] | +- org.apache.httpcomponents

Re: Spark and S3 server side encryption

2015-01-28 Thread Charles Feduke
I have been trying to work around a similar problem with my Typesafe config *.conf files seemingly not appearing on the executors. (Though now that I think about it its not because the files are absent in the JAR, but because the -Dconf.resource environment variable I pass to the master obviously d

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
Yeah it sounds like your original exclusion of commons-httpclient from hadoop-* was correct, but its still coming in from somewhere. Can you try something like this?: commons-http httpclient provided ref: http://stackoverflow.com/questions/4716310/is-there-a-way-to-exclude-a-maven-

RE: RDD caching, memory & network input

2015-01-28 Thread Andrianasolo Fanilo
Each machine has 24 cores, but I assume each executor on a machine is attributed one core max because I set the –executor-cores property to 1. I’m going to try a higher memoryOverhead later, I’ll post the results. I’m caching the parsed version, something like val matrix = Predi

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-28 Thread Guru Medasani
Hi Antony, Did you get pass this error by repartitioning your job with smaller tasks as Sven Krasser pointed out? From: Antony Mayi Reply-To: Antony Mayi Date: Tuesday, January 27, 2015 at 5:24 PM To: Guru Medasani , Sven Krasser Cc: Sandy Ryza , "user@spark.apache.org" Subject: Re: ja

reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread YaoPau
The TwitterPopularTags example works great: the Twitter firehose keeps messages pretty well in order by timestamp, and so to get the most popular hashtags over the last 60 seconds, reduceByKeyAndWindow works well. My stream pulls Apache weblogs from Kafka, and so it's not as simple: messages can p

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Ey-chih, That makes more sense. This is a known issue that will be fixed as part of SPARK-5242 . Charles, Thanks for the info. In your case, when does spark-ec2 hang? Only when the specified path to the identity file doesn't exist? Or also when y

Snappy Crash

2015-01-28 Thread Sven Krasser
I'm running into a new issue with Snappy causing a crash (using Spark 1.2.0). Did anyone see this before? -Sven 2015-01-28 16:09:35,448 WARN [shuffle-server-1] storage.MemoryStore (Logging.scala:logWarning(71)) - Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_45_1

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Davies Liu
HadoopRDD will try to split the file as 64M partitions in size, so you got 1916+ partitions. (assume 100k per row, they are 80G in size). I think it has very small chance that one object or one batch will be bigger than 2G. Maybe there are a bug when it split the pickled file, could you create a R

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Akhil Das
I'm not quiet sure if i understood it correctly, but can you not create a key from the timestamps and do the reduceByKeyAndWindow over it? Thanks Best Regards On Wed, Jan 28, 2015 at 10:24 PM, YaoPau wrote: > The TwitterPopularTags example works great: the Twitter firehose keeps > messages pret

Re: How to unregister/re-register a TempTable in Spark?

2015-01-28 Thread Akhil Das
Like this? case class Person(name: String, age: Int) val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.registerTempTable("people") people.unpersist() You can see this basic documentation

Re: Set is not parseable as row field in SparkSql

2015-01-28 Thread Cheng Lian
Hey Jorge, This is expected. Because there isn’t an obvious mapping from |Set[T]| to any SQL types. Currently we have complex types like array, map, and struct, which are inherited from Hive. In your case, I’d transform the |Set[T]| into a |Seq[T]| first, then Spark SQL can map it to an array.

MappedRDD signature

2015-01-28 Thread Sanjay Subramanian
hey guys  I am not following why this happens DATASET===Tab separated values (164 columns) Spark command 1val mjpJobOrderRDD = sc.textFile("/data/cdr/cdr_mjp_joborder_raw")val mjpJobOrderColsPairedRDD = mjpJobOrderRDD.map(line => { val tokens = line.split("\t");(tokens(23),to

Re: MappedRDD signature

2015-01-28 Thread Sean Owen
I think it's clear if you format your function reasonably: mjpJobOrderRDD.map(line => { val tokens = line.split("\t"); if (tokens.length == 164 && tokens(23) != null) { (tokens(23),tokens(7)) } }) In some cases the function returns nothing, in some cases a tuple. The return type is ther

Re: MappedRDD signature

2015-01-28 Thread Sanjay Subramanian
Thanks Sean. that works and I started the join of this mappedRDD to another one I have.I have to internalize the use of Map versus FlatMap. Thinking Map Reduce Java Hadoop code often blinds me :-)  From: Sean Owen To: Sanjay Subramanian Cc: Cheng Lian ; Jorge Lopez-Malla ; "user@spark.

spark-shell working in scala-2.11

2015-01-28 Thread Stephen Haberman
Hey, I recently compiled Spark master against scala-2.11 (by running the dev/change-versions script), but when I run spark-shell, it looks like the "sc" variable is missing. Is this a known/unknown issue? Are others successfully using Spark with scala-2.11, and specifically spark-shell? It is po

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 & 2.11. Cheers On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman < stephen.haber...@gmail.com> wrote: > Hey, > > I recently compiled Spark master against

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
It was only hanging when I specified the path with ~ I never tried relative. Hanging on the waiting for ssh to be ready on all hosts. I let it sit for about 10 minutes then I found the StackOverflow answer that suggested specifying an absolute path, cancelled, and re-run with --resume and the abso

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Hmm, I can’t see why using ~ would be problematic, especially if you confirm that echo ~/path/to/pem expands to the correct path to your identity file. If you have a simple reproduction of the problem, please send it over. I’d love to look into this. When I pass paths with ~ to spark-ec2 on my sys

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
Yeah, I agree ~ should work. And it could have been [read: probably was] the fact that one of the EC2 hosts was in my known_hosts (don't know, never saw an error message, but the behavior is no error message for that state), which I had fixed later with Pete's patch. But the second execution when t

Parquet divide by zero

2015-01-28 Thread Jim Carroll
Hello all, I've been hitting a divide by zero error in Parquet though Spark detailed (and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102 Is anyone else hitting this error? I hit it frequently. It looks like the Parquet team is preparing to release 1.6.0 and, since they have

Re: Parquet divide by zero

2015-01-28 Thread Sean Owen
It looks like it's just a problem with the log message? is it actually causing a problem in Parquet / Spark? but yeah seems like an easy fix. On Wed, Jan 28, 2015 at 9:28 PM, Jim Carroll wrote: > Hello all, > > I've been hitting a divide by zero error in Parquet though Spark detailed > (and fixed

Re: Parquet divide by zero

2015-01-28 Thread Sean Owen
Answered my own questions seconds later: these aren't doubles, so you don't get NaN, you get an Exception. Right. On Wed, Jan 28, 2015 at 9:35 PM, Sean Owen wrote: > It looks like it's just a problem with the log message? is it actually > causing a problem in Parquet / Spark? but yeah seems like

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-28 Thread Wang, Ningjun (LNG-NPV)
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or windows 7? How do you get that works? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent:

Re: Spark on Windows 2008 R2 serv er does not work

2015-01-28 Thread Marcelo Vanzin
https://issues.apache.org/jira/browse/SPARK-2356 Take a look through the comments, there are some workarounds listed there. On Wed, Jan 28, 2015 at 1:40 PM, Wang, Ningjun (LNG-NPV) wrote: > Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or > windows 7? How do you get tha

unsubscribe

2015-01-28 Thread Abhi Basu
-- Abhi Basu

Re: unsubscribe

2015-01-28 Thread Ted Yu
send an email to user-unsubscr...@spark.apache.org Cheers On Wed, Jan 28, 2015 at 2:16 PM, Abhi Basu <9000r...@gmail.com> wrote: > > > -- > Abhi Basu >

Re: Parquet divide by zero

2015-01-28 Thread Lukas Nalezenec
Hi Jim, I am sorry, I know about your patch and I will commit it ASAP. Lukas Nalezenec On 28.1.2015 22:28, Jim Carroll wrote: Hello all, I've been hitting a divide by zero error in Parquet though Spark detailed (and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102 Is anyo

Appending to an hdfs file

2015-01-28 Thread Matan Safriel
Hi, Is it possible to append to an existing (hdfs) file, through some Spark action? Should there be any reason not to use a hadoop append api within a Spark job? Thanks, Matan

Re: Data Locality

2015-01-28 Thread Harihar Nahak
Hi Guys, I have the similar question and doubt. How spark create an executor on the same node where is data block stored? Does it first take information from HDFS name mode, get the block information and then place executor on the same node is spark-worker demon is installed? - --

Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-28 Thread Sarwar Bhuiyan
Hello all, I'm trying to build the sample application on the spark 1.2.0 quickstart page (https://spark.apache.org/docs/latest/quick-start.html) using the following build.sbt file: name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "

Re: Appending to an hdfs file

2015-01-28 Thread Sean Owen
You can call any API you like in a Spark job, as long as the libraries are available, and Hadoop HDFS APIs will be available from the cluster. You could write a foreachPartition() that appends partitions of data to files, yes. Spark itself does not use appending. I think the biggest reason is that

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Imran Rashid
I'm not an expert on streaming, but I think you can't do anything like this right now. It seems like a very sensible use case, though, so I've created a jira for it: https://issues.apache.org/jira/browse/SPARK-5467 On Wed, Jan 28, 2015 at 8:54 AM, YaoPau wrote: > The TwitterPopularTags example

Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-28 Thread sarwar.bhuiyan
Hello all, I'm trying to build the sample application on the spark 1.2.0 quickstart page (https://spark.apache.org/docs/latest/quick-start.html) using the following build.sbt file: name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "

Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread ogoh
Hello, probably this question was already asked but still I'd like to confirm from Spark users. This following blog shows 'hive on spark' : http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/";. How is it different from using hive as data storage of SparkSQL (http://spa

Re: Data Locality

2015-01-28 Thread hnahak
I have wrote a custom input split and I want to set to the specific node, where my data is stored. but currently split can start at any node and pick data from different node in the cluster. any suggestion, how to set host in spark -- View this message in context: http://apache-spark-user-

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau wrote: > > My thinking is to maintain state in an RDD and update it an persist it with > each 2-second pass, but this also seems like it could get messy. Any > thoughts or examples that might help me? > I have just implemented some timestamp-based win

is there a master for spark cluster in ec2

2015-01-28 Thread Mohit Singh
Hi, Probably a naive question.. But I am creating a spark cluster on ec2 using the ec2 scripts in there.. But is there a master param I need to set.. ./bin/pyspark --master [ ] ?? I don't yet fully understand the ec2 concepts so just wanted to confirm this?? Thanks -- Mohit "When you want succ

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-28 Thread Cheng Lian
Hey Yana, An update about this Parquet filter push-down issue. It turned out to be a bit complicated, but (hopefully) all clear now. 1. Yesterday I found a bug in Parquet, which essentially disables row group filtering for almost all |AND| predicates. * JIRA ticket: PARQUET-173

RE: unsubscribe

2015-01-28 Thread Bob Tiernay
Cheers Date: Wed, 28 Jan 2015 14:18:49 -0800 Subject: Re: unsubscribe From: yuzhih...@gmail.com To: 9000r...@gmail.com CC: user@spark.apache.org send an email to user-unsubscr...@spark.apache.org Cheers On Wed, Jan 28, 2015 at 2:16 PM, Abhi Basu <9000r...@gmail.com> wrote: -- Abhi Basu

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Peter Zybrick
Below is trace from trying to access with ~/path. I also did the echo as per Nick (see the last line), looks ok to me. This is my development box with Spark 1.2.0 running CentOS 6.5, Python 2.6.6 [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file=~

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I think this repartitionAndSortWithinPartitions() method may be what I'm looking for in [1]. At least it sounds like it is. Will this method allow me to deal with sorted partitions even when the partition doesn't fit into memory? [1] https://github.com/apache/spark/blob/branch-1.2/core/src/main/sc

Data are partial to a specific partition after sort

2015-01-28 Thread 瀬川 卓也
For example, We consider the word count of the long text data (100GB order). There is clearly a bias for the word , has been expected to be a long tail data do word count. Probably word number 1 occupies about over 1 / 10. word count code ``` val allWordLineSplited: RDD[String] = // create RD

Re: Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-28 Thread Jim Green
Thanks for all respnding. Finally I figured out the way to use bulk load to hbase using scala on spark. The sample code is here which others can refer in future: http://www.openkb.info/2015/01/how-to-use-scala-on-spark-to-load-data.html Thanks! On Tue, Jan 27, 2015 at 6:27 PM, Jim Green wrote:

StackOverflowError with SchemaRDD

2015-01-28 Thread ankits
Hi, I am getting a stack overflow error when querying a schemardd comprised of parquet files. This is (part of) the stack trace: Caused by: java.lang.StackOverflowError at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tathagata Das
Ohhh nice! Would be great if you can share us some code soon. It is indeed a very complicated problem and there is probably no single solution that fits all usecases. So having one way of doing things would be a great reference. Looking forward to that! On Wed, Jan 28, 2015 at 4:52 PM, Tobias Pfei

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Thanks for sending this over, Peter. What if you try this? (i.e. Remove the = after --identity-file.) ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file ~/.pzkeys/spark-streaming-kp.pem --region=us-east-1 login pz-spark-cluster If that works, then I think the problem in this case is si

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
If that was indeed the problem, I suggest updating your answer on SO to help others who may run into this same problem. ​ On Wed Jan 28 2015 at 9:40:39 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Thanks for sending this over, Peter. > >

Re: data locality in logs

2015-01-28 Thread hnahak
Hi How to set a preferred location for an InputSplit in spark standalone? I have data in specific machine and I want to read them using Splits which is created for that node only, by assigning some property which help Spark to create a split in that node only. -- View this message in co

Re: Error reporting/collecting for users

2015-01-28 Thread Tathagata Das
You could use foreachRDD to do the operations and then inside the foreach create an accumulator to gather all the errors together dstream.foreachRDD { rdd => val accumulator = new Accumulator[] rdd.map { . }.count // whatever operation that is error prone // gather all errors

Re: spark sqlContext udaf

2015-01-28 Thread sunwei
Thanks very much. It seems that I have to use HiveContext at present. 在 2015年1月28日,上午11:34,Kuldeep Bora 写道: > UDAF is a WIP, at least from API user's perspective as there is no public API > to my knowledge. > > https://issues.apache.org/jira/browse/SPARK-3947 > > Thanks > > On Tue, Jan 27, 2

Re: Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread Arush Kharbanda
Spark SQL on Hive 1. The purpose of Spark SQL is to allow Spark users to selectively use SQL expressions (with not a huge number of functions currently supported) when writing Spark jobs 2. Already Available Hive on Spark 1.Spark users will automatically get the whole set of Hive’s rich features,

RE: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Shao, Saisai
That's definitely a good supplement to the current Spark Streaming, I've heard many guys want to process the data using log time. Looking forward to the code. Thanks Jerry -Original Message- From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Thursday, January 29, 2015 10:33