Re: Filter operation to return two RDDs at once.

2015-06-02 Thread Sean Owen
In the sense here, Spark actually does have operations that make multiple RDDs like randomSplit. However there is not an equivalent of the partition operation which gives the elements that matched and did not match at once. On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang wrote: > As far as I know, spark

Re: Filter operation to return two RDDs at once.

2015-06-02 Thread Jeff Zhang
As far as I know, spark don't support multiple outputs On Wed, Jun 3, 2015 at 2:15 PM, ayan guha wrote: > Why do you need to do that if filter and content of the resulting rdd are > exactly same? You may as well declare them as 1 RDD. > On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > >> I want to

ERROR cluster.YarnScheduler: Lost executor

2015-06-02 Thread patcharee
Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apac

Re: Application is "always" in process when I check out logs of completed application

2015-06-02 Thread ayan guha
Have you done sc.stop() ? :) On 3 Jun 2015 14:05, "amghost" wrote: > I run spark application in spark standalone cluster with client deploy > mode. > I want to check out the logs of my finished application, but I always get > a > page telling me "Application history not found - Application xxx is

Re: Filter operation to return two RDDs at once.

2015-06-02 Thread ayan guha
Why do you need to do that if filter and content of the resulting rdd are exactly same? You may as well declare them as 1 RDD. On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > I want to do this > > val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId > != NULL_VALUE) > > val

Spark Client

2015-06-02 Thread pavan kumar Kolamuri
Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. But for my use case i don't want it to be exit with System.exit . Is there any other spark client which is api friendly other than SparkSubmit which shouldn't exit with system.exit. Please correct me if i am missing somethin

How to create fewer output files for Spark job ?

2015-06-02 Thread ๏̯͡๏
I am running a series of spark functions with 9000 executors and its resulting in 9000+ files that is execeeding the namespace file count qutota. How can Spark be configured to use CombinedOutputFormat. {code} protected def writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord], NullW

Filter operation to return two RDDs at once.

2015-06-02 Thread ๏̯͡๏
I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt, guidU

Re: Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
Thanks Marcelo - looks like it was my fault. Seems when we deployed the new version of spark it was picking up the wrong yarn site and setting the wrong proxy host. All good now! On Wed, Jun 3, 2015 at 11:01 AM, Marcelo Vanzin wrote: > That code hasn't changed at all between 1.3 and 1.4; it al

Application is "always" in process when I check out logs of completed application

2015-06-02 Thread amghost
I run spark application in spark standalone cluster with client deploy mode. I want to check out the logs of my finished application, but I always get a page telling me "Application history not found - Application xxx is still in process". I am pretty sure that the application has indeed completed

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Shixiong Zhu
Could you set "spark.shuffle.io.preferDirectBufs" to false to turn off the off-heap allocation of netty? Best Regards, Shixiong Zhu 2015-06-03 11:58 GMT+08:00 Ji ZHANG : > Hi, > > Thanks for you information. I'll give spark1.4 a try when it's released. > > On Wed, Jun 3, 2015 at 11:31 AM, Tathag

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Ji ZHANG
Hi, Thanks for you information. I'll give spark1.4 a try when it's released. On Wed, Jun 3, 2015 at 11:31 AM, Tathagata Das wrote: > Could you try it out with Spark 1.4 RC3? > > Also pinging, Cloudera folks, they may be aware of something. > > BTW, the way I have debugged memory leaks in the pa

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Tathagata Das
Could you try it out with Spark 1.4 RC3? Also pinging, Cloudera folks, they may be aware of something. BTW, the way I have debugged memory leaks in the past is as follows. Run with a small driver memory, say 1 GB. Periodically (maybe a script), take snapshots of histogram and also do memory dump

Re: in GraphX,program with Pregel runs slower and slower after several iterations

2015-06-02 Thread Cheuk Lam
I've been encountering something similar too. I suspected that was related to the lineage growth of the graph/RDDs. So I checkpoint the graph every 60 Pregel rounds, after doing which my program doesn't slow down any more (except that every checkpoint takes some extra time). -- View this messa

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Ji ZHANG
Hi, Thanks for you reply. Here's the top 30 entries of jmap -histo:live result: num #instances #bytes class name -- 1: 40802 145083848 [B 2: 99264 12716112 3: 99264 12291480 4:

Re: Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Marcelo Vanzin
That code hasn't changed at all between 1.3 and 1.4; it also has been working fine for me. Are you sure you're using exactly the same Hadoop libraries (since you're building with -Phadoop-provided) and Hadoop configuration in both cases? On Tue, Jun 2, 2015 at 5:29 PM, Night Wolf wrote: > Hi al

Re: Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
Just testing with Spark 1.3, it looks like it sets the proxy correctly to be the YARN RM host (0101) 15/06/03 10:34:19 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT] 15/06/03 10:34:20 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1432690361766_0596_00

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Sean Owen
We are having a separate discussion about this but, I don't understand why this is a problem? You're supposed to build Spark for Hadoop 1 if you run it on Hadoop 1 and I am not sure that is happening here, given the error. I do not think this should change as I do not see that it fixes something.

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great. On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu wrote: > Ryan - I sent a PR to fix your issue: > https://github.com/apache/spark/pull/6599 > > Edward - I have no idea why the following error happened. "ContextCleaner" > doesn't use any Hadoop API. Could you tr

Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
Hi all, Trying out Spark 1.4 on MapR Hadoop 2.5.1 running in yarn-client mode. Seems the application master doesn't work anymore, I get a 500 connect refused, even when I hit the IP/port of the spark UI directly. The logs don't show much. I build spark with Java 6, hive & scala 2.10 and 2.11. I'v

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Shixiong Zhu
Ryan - I sent a PR to fix your issue: https://github.com/apache/spark/pull/6599 Edward - I have no idea why the following error happened. "ContextCleaner" doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to support both hadoop 1 and hadoop 2. * "Exception in thread "Spark Cont

How to limit the total number of objects in a DStream maintained by updateStateByKey?

2015-06-02 Thread frank_zhang
Hi, I have a spark streaming app that uses updateStateByKey to maintain a cache JavaPairDStream, in which the "String" is the key, and CacheData is a type contains 3 fields: timestamp, index and value. I want to restrict to total number of objects in the cache and prune the cache based on LRU, i.

Scripting with groovy

2015-06-02 Thread Paolo Platter
Hi all, Has anyone tried to add Scripting capabilities to spark streaming using groovy? I would like to stop the streaming context, update a transformation function written in groovy( for example to manipulate json ), restart the streaming context and obtain a new behavior without re-submit the

Re: Can't build Spark

2015-06-02 Thread Ted Yu
I ran dev/change-version-to-2.11.sh first. I used the following command but didn't reproduce the error below: mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package My env: maven 3.3.1 Possibly the error was related to proxy setting. FYI On Tue, Jun 2, 2015 at 3:14 PM, Mulugeta Mammo wrote

Behavior of the spark.streaming.kafka.maxRatePerPartition config param?

2015-06-02 Thread dgoldenberg
Hi, Could someone explain the behavior of the spark.streaming.kafka.maxRatePerPartition parameter? The doc says "An important (configuration) is spark.streaming.kafka.maxRatePerPartition which is the maximum rate at which each Kafka partition will be read by (the) direct API." What is the default

Re: How to monitor Spark Streaming from Kafka?

2015-06-02 Thread Ruslan Dautkhanov
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4 http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf -- Ruslan Dautkhanov On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg wrote: > Thank you, Tathagata, Cody, Otis. > > - Dmitr

Re: Can't build Spark

2015-06-02 Thread Mulugeta Mammo
Spark 1.3.1, Scala 2.11.6, Maven 3.3.3, I'm behind proxy, have set my proxy settings in maven settings. Thanks, On Tue, Jun 2, 2015 at 2:54 PM, Ted Yu wrote: > Can you give us some more information ? > Such as: > which Spark release you were building > what command you used > Scala version you

Re: Can't build Spark

2015-06-02 Thread Ted Yu
Can you give us some more information ? Such as: which Spark release you were building what command you used Scala version you used Thanks On Tue, Jun 2, 2015 at 2:50 PM, Mulugeta Mammo wrote: > building Spark is throwing errors, any ideas? > > > [FATAL] Non-resolvable parent POM: Could not tra

Can't build Spark

2015-06-02 Thread Mulugeta Mammo
building Spark is throwing errors, any ideas? [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: repo.maven.apache.org from http://repo.maven.apache.org/maven2/org/apache/apache/1

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Dimp Bhat
I found this : https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/ml/feature/Tokenizer.html which indicates the Tokenizer did exist in Spark 1.2.0 then and not in 1.2.1? On Tue, Jun 2, 2015 at 12:45 PM, Peter Rudenko wrote: > I'm afraid there's no such class for 1.2.1. This API was a

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
Yup, exactly. All the workers will use local disk in addition to RAM, but maybe one thing you need to configure is the directory to use for that. It should be set trough spark.local.dir. By default it's /tmp, which on some machines is also in RAM, so that could be a problem. You should set it t

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. On Tuesday, June 2, 2015, Olivier Girardot wrote: > Nice to hear from you Holden ! I ended up trying exactly that (Column) - > but I may have done it

Re: Best strategy for Pandas -> Spark

2015-06-02 Thread Olivier Girardot
Thanks for the answer, I'm currently doing exactly that. I'll try to sum-up the usual Pandas <=> Spark Dataframe caveats soon. Regards, Olivier. Le mar. 2 juin 2015 à 02:38, Davies Liu a écrit : > The second one sounds reasonable, I think. > > On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Nice to hear from you Holden ! I ended up trying exactly that (Column) - but I may have done it wrong : In [*5*]: g.agg(Column("percentile(value, 0.5)")) Py4JError: An error occurred while calling o97.agg. Trace: py4j.Py4JException: Method agg([class java.lang.String, class scala.collection.immuta

Re: Can't build Spark 1.3

2015-06-02 Thread Ted Yu
Have you run zinc during build ? See build/mvn which installs zinc. Cheers On Tue, Jun 2, 2015 at 12:26 PM, Ritesh Kumar Singh < riteshoneinamill...@gmail.com> wrote: > It did hang for me too. High RAM consumption during build. Had to free a > lot of RAM and introduce swap memory just to get it

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
I'm afraid there's no such class for 1.2.1. This API was added to 1.3.0 AFAIK. On 2015-06-02 21:40, Dimp Bhat wrote: Thanks Peter. Can you share the Tokenizer.java class for Spark 1.2.1. Dimple On Tue, Jun 2, 2015 at 10:51 AM, Peter Rudenko mailto:petro.rude...@gmail.com>> wrote: Hi Di

DataFrames coming in SparkR in Apache Spark 1.4.0

2015-06-02 Thread Emaasit
For the impatient R-user, here is a link to get started working with DataFrames using SparkR. Or copy and paste this link into your web browser: http://people.apache.org/~pwendell/spark-nightly/spark-1.4-docs/

Re: IDE for sparkR

2015-06-02 Thread Emaasit
Rstudio is the best IDE for running sparkR. Instructions for this can be found at this link . You will need to set some environment variables as described below. *Using SparkR from RStudio* If you wish to use SparkR from RStudio or other R fro

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
You shouldn't have to persist the RDD at all, just call flatMap and reduce on it directly. If you try to persist it, that will try to load the original dat into memory, but here you are only scanning through it once. Matei > On Jun 2, 2015, at 2:09 AM, Octavian Ganea wrote: > > Thanks, > > I

Re: Can't build Spark 1.3

2015-06-02 Thread Ritesh Kumar Singh
It did hang for me too. High RAM consumption during build. Had to free a lot of RAM and introduce swap memory just to get it build in my 3rd attempt. Everything else looks fine. You can download the prebuilt versions from the Spark homepage to save yourself from all this trouble. Thanks, Ritesh

Can't build Spark 1.3

2015-06-02 Thread Yakubovich, Alexey
\ I downloaded the latest Spark (1.3.) from github. Then I tried to build it. First for scala 2.10 (and hadoop 2.4): build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package That resulted in hangup after printing bunch of line like [INFO] Dependency-reduced POM written at …

Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread David Mitchell
I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, "Executor Info" from the Spark logs. I suggest that we open a JIRA ticket to address this issue. On Jun 2, 2015 10:08 AM, "ayan guha" wrote: > I would think the easiest way would b

Issues with Spark Streaming and Manual Clock used for Unit Tests

2015-06-02 Thread mobsniuk
I have a situation where I have multiple tests that use Spark streaming with Manual clock. The first run is OK and processes the data when I increment the clock to the sliding window duration. The second test deviates and doesn't process any data. The traces follows which indicates memory store is

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Yin Huai
Does it happen every time you read a parquet source? On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg wrote: > The log is from the log aggregation tool (hortonworks, "yarn logs ..."), > so both executors and driver. I'll send a private mail to you with the full > logs. Also, tried another job as yo

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Dimp Bhat
Thanks Peter. Can you share the Tokenizer.java class for Spark 1.2.1. Dimple On Tue, Jun 2, 2015 at 10:51 AM, Peter Rudenko wrote: > Hi Dimple, > take a look to existing transformers: > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Krot Viacheslav
Thanks, this works. Hopefully I didn't miss something important with this approach. вт, 2 июня 2015 г. в 20:15, Cody Koeninger : > If you're using the spark partition id directly as the key, then you don't > need to access offset ranges at all, right? > You can create a single instance of a parti

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
So in spark is after acquiring executors from ClusterManeger, does tasks are scheduled on executors based on datalocality ?I Mean if in an application there are 2 jobs and output of 1 job is used as input of another job. And in job1 I did persist on some RDD, then while running job2 will it use th

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Cody Koeninger
If you're using the spark partition id directly as the key, then you don't need to access offset ranges at all, right? You can create a single instance of a partitioner in advance, and all it needs to know is the number of partitions (which is just the count of all the kafka topic/partitions). On

[OFFTOPIC] Big Data Application Meetup

2015-06-02 Thread Alex Baranau
Hi everyone, I wanted to drop a note about a newly organized developer meetup in Bay Area: the Big Data Application Meetup (http://meetup.com/bigdataapps) and call for speakers. The plan is for meetup topics to be focused on application use-cases: how developers can build end-to-end solutions with

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Richard Marscher
Are you sure it's memory related? What is the disk utilization and IO performance on the workers? The error you posted looks to be related to shuffle trying to obtain block data from another worker node and failing to do so in reasonable amount of time. It may still be memory related, but I'm not s

Re: HDFS Rest Service not available

2015-06-02 Thread Su She
Ahh, this did the trick, I had to get the name node out of same mode however before it fully worked. Thanks! On Tue, Jun 2, 2015 at 12:09 AM, Akhil Das wrote: > It says your namenode is down (connection refused on 8020), you can restart > your HDFS by going into hadoop directory and typing sbin/

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
Hi Dimple, take a look to existing transformers: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala https://github.com/apache/s

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Dimp Bhat
Thanks for the quick reply Ram. Will take a look at the Tokenizer code and try it out. Dimple On Tue, Jun 2, 2015 at 10:42 AM, Ram Sriharsha wrote: > Hi > > We are in the process of adding examples for feature transformations ( > https://issues.apache.org/jira/browse/SPARK-7546) and this shoul

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Ram Sriharsha
Hi We are in the process of adding examples for feature transformations ( https://issues.apache.org/jira/browse/SPARK-7546) and this should be available shortly on Spark Master. In the meanwhile, the best place to start would be to look at how the Tokenizer works here: https://github.com/apache/sp

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Krot Viacheslav
Cody, Thanks, good point. I fixed getting partition id to: class MyPartitioner(offsetRanges: Array[OffsetRange]) extends Partitioner { override def numPartitions: Int = offsetRanges.size override def getPartition(key: Any): Int = { // this is set in .map(m => (TaskContext.get().partition

Re: data localisation in spark

2015-06-02 Thread Sandy Ryza
It is not possible with JavaSparkContext either. The API mentioned below currently does not have any effect (we should document this). The primary difference between MR and Spark here is that MR runs each task in its own YARN container, while Spark runs multiple tasks within an executor, which ne

Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread dimple
Hi, I would like to embed my own transformer in the Spark.ml Pipleline but do not see an example of it. Can someone share an example of which classes/interfaces I need to extend/implement in order to do so. Thanks. Dimple -- View this message in context: http://apache-spark-user-list.1001560.n

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Cody Koeninger
I think the general idea is worth pursuing. However, this line: override def getPartition(key: Any): Int = { key.asInstanceOf[(String, Int)]._2 } is using the kafka partition id, not the spark partition index, so it's going to give you fewer partitions / incorrect index Cast the rdd to H

Re: Spark 1.3.0: how to let Spark history load old records?

2015-06-02 Thread Marcelo Vanzin
Take a look at the Spark History Server (see documentation). On Mon, Jun 1, 2015 at 8:36 PM, Haopu Wang wrote: > When I start the Spark master process, the old records are not shown in > the monitoring UI. > > How to show the old records? Thank you very much! > > > --

Scala By the Bay + Big Data Scala 2015 Program Announced

2015-06-02 Thread Alexy Khrabrov
The programs and schedules for Scala By the Bay (SBTB) and Big Data Scala By the Bay (BDS) 2015 conferences are announced and published: Scala By the Bay — August 13-16 ( scala.bythebay.io) Big Data Scala By the Bay — August 16-18 ( big

Re: Spark 1.3.0: how to let Spark history load old records?

2015-06-02 Thread Otis Gospodnetic
I think Spark doesn't keep historical metrics. You can use something like SPM for that - http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.co

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-02 Thread Josh Rosen
My suggestion is that you change the Spark setting which controls the compression codec that Spark uses for internal data transfers. Set spark.io.compression.codec to lzf in your SparkConf. On Mon, Jun 1, 2015 at 8:46 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Hello Josh, > Are you suggesting to store the sourc

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
Is it possible in JavaSparkContext ? JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDDlines = jsc.textFile(args[0]); If yes , does its programmer's responsibilty to first calculate splits locations and then instantiate spark context with preferred locations? How does its achieved in MR2

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
Ah, interesting. While working on my new Tungsten shuffle manager, I came up with some nice testing interfaces for allowing me to manually trigger spills in order to deterministically test those code paths without requiring large amounts of data to be shuffled. Maybe I could make similar test int

updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Krot Viacheslav
Hi all, In my streaming job I'm using kafka streaming direct approach and want to maintain state with updateStateByKey. My PairRDD has message's topic name + partition id as a key. So, I assume that updateByState could work within same partition as KafkaRDD and not lead to shuffles. Actually this i

Cant figure out spark-sql errors - switching to Impala - sorry guys

2015-06-02 Thread Sanjay Subramanian
Cant figure out spark-sql errors - switching to Hive and Impala for now - sorry guys, no hard feelings From: Sanjay Subramanian To: Sanjay Subramanian ; user Sent: Saturday, May 30, 2015 1:52 PM Subject: Re: spark-sql errors any ideas guys ? how to solve this ? From: Sanja

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM to Spark 1.3.1 (cf. adam#690 ); attempting to build against Hadoop 1.0.4 yields errors like: 2015-06-02 15:57:44 ERROR Executor:96 - Exce

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way (

Re: build jar with all dependencies

2015-06-02 Thread Paul Röwer
which maven dependency i need, too?? http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html Am 02.06.2015 um 16:04 schrieb Yana Kadiyska: Can you run using spark-submit? What is happening is that you are running a simple java program -- you've w

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
I've finally come to the same conclusion, but isn't there any way to call this Hive UDAFs from the agg("percentile(key,0.5)") ?? Le mar. 2 juin 2015 à 15:37, Yana Kadiyska a écrit : > Like this...sqlContext should be a HiveContext instance > > case class KeyValue(key: Int, value: String) > val d

Transactional guarantee while saving DataFrame into a DB

2015-06-02 Thread Mohammad Tariq
Hi list, With the help of Spark DataFrame API we can save a DataFrame into a database table through insertIntoJDBC() call. However, I could not find any info about how it handles the transactional guarantee. What if my program gets killed during the processing? Would it end up in partial load? Is

Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread ayan guha
I would think the easiest way would be to create a view in DB with column names with no space. In fact, you can "pass" a sql in place of a real table. >From documentation: "The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used. For exam

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
Can you run using spark-submit? What is happening is that you are running a simple java program -- you've wrapped spark-core in your fat jar but at runtime you likely need the whole Spark system in order to run your application. I would mark the spark-core as provided(so you don't wrap it in your f

Re: build jar with all dependencies

2015-06-02 Thread Pa Rö
okay, but how i can compile my app to run this without -Dconfig.file=alt_ reference1.conf? 2015-06-02 15:43 GMT+02:00 Yana Kadiyska : > This looks like your app is not finding your Typesafe config. The config > should usually be placed in particular folder under your app to be seen > correctly. I

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to be seen correctly. If it's in a non-standard location you can pass -Dconfig.file=alt_reference1.conf to java to tell it where to look. If this is a config that b

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF df.registerTempTable("table") sqlContext.sql("select percentile(key,0.5) from table").show() ​ On Tue, Jun 2, 2015 at 8:07 AM,

Re: How to read sequence File.

2015-06-02 Thread Akhil Das
Basically, you need to convert it to a serializable format before doing the collect/take. You can fire up a spark shell and paste this: val sFile = sc.sequenceFile[LongWritable, Text]("/home/akhld/sequence > /sigmoid") > *.map(_._2.toString)* > sFile.take(5).foreach(println) Use t

Re: How to read sequence File.

2015-06-02 Thread ๏̯͡๏
Spark Shell: val x = sc.sequenceFile("/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761", classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text]) OR val x = sc.sequenceFile("/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761", classOf[org.apache.hadoop.io.Text], cl

How to read sequence File.

2015-06-02 Thread ๏̯͡๏
I have a sequence file SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v? Key = Text Value = Text and it seems to be using GzipCodec. How should i read it from Spark I am using val x = sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).parti

Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience with this

build jar with all dependencies

2015-06-02 Thread Pa Rö
hello community, i have build a jar file from my spark app with maven (mvn clean compile assembly:single) and the following pom file: http://maven.apache.org/POM/4.0.0"; xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.ap

Re: Shared / NFS filesystems

2015-06-02 Thread Akhil Das
You can run/submit your code from one of the worker which has access to the file system and it should be fine i think. Give it a try. Thanks Best Regards On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar wrote: > Hello! > > I have Spark running in standalone mode, and there are a bunch of worker

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Anders Arpteg
The log is from the log aggregation tool (hortonworks, "yarn logs ..."), so both executors and driver. I'll send a private mail to you with the full logs. Also, tried another job as you suggested, and it actually worked fine. The first job was reading from a parquet source, and the second from an a

Shared / NFS filesystems

2015-06-02 Thread Pradyumna Achar
Hello! I have Spark running in standalone mode, and there are a bunch of worker nodes connected to the master. The workers have a shared file system, but the master node doesn't. Is this something that's not going to work? i.e., should the master node also be on the same shared filesystem mounted

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Shixiong Zhu
How about other jobs? Is it an executor log, or a driver log? Could you post other logs near this error, please? Thank you. Best Regards, Shixiong Zhu 2015-06-02 17:11 GMT+08:00 Anders Arpteg : > Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that > worked fine for Spark 1.3.

Re: What is shuffle read and what is shuffle write ?

2015-06-02 Thread Akhil Das
I found an interesting presentation http://www.slideshare.net/colorant/spark-shuffle-introduction and go through this thread also http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html Thanks Best Regards On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:

What is shuffle read and what is shuffle write ?

2015-06-02 Thread ๏̯͡๏
Is it input and ouput bytes/record size ? -- Deepak

Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Anders Arpteg
Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that worked fine for Spark 1.3. The job starts on the cluster (yarn-cluster mode), initial stage starts, but the job fails before any task succeeds with the following error. Any hints? [ERROR] [06/02/2015 09:05:36.962] [Executor ta

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread octavian.ganea
I was tried using reduceByKey, without success. I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . However, I got the same error as before, namely the error described here: http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-

spark sql - reading data from sql tables having space in column names

2015-06-02 Thread Sachin Goyal
Hi, We are using spark sql (1.3.1) to load data from Microsoft sql server using jdbc (as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases ). It is working fine except when there is a space in column names (we can't modify the schemas to remove s

Re: Streaming K-medoids

2015-06-02 Thread Marko Dinic
Erik, Thank you for your answer. It seems really good, but unfortunately I'm not very familiar with Scala, so I have partly understood. Could you please explain your idea with Spark implementation? Best regards, Marko On Mon 01 Jun 2015 06:35:17 PM CEST, Erik Erlandson wrote: I haven't giv

Insert overwrite to hive - ArrayIndexOutOfBoundsException

2015-06-02 Thread patcharee
Hi, I am using spark 1.3.1. I tried to insert (a new partition) into an existing partitioned hive table, but got ArrayIndexOutOfBoundsException. Below is a code snippet and the debug log. Any suggestions please. + case class Record4Dim(key: String,

Re: Spark stages very slow to complete

2015-06-02 Thread Karlson
Hi, the code is some hundreds lines of Python. I can try to compose a minimal example as soon as I find the time, though. Any ideas until then? Would you mind posting the code? On 2 Jun 2015 00:53, "Karlson" wrote: Hi, In all (pyspark) Spark jobs, that become somewhat more involved, I am e

Union data type

2015-06-02 Thread Agarwal, Shagun
Hi, As per the doc https://github.com/databricks/spark-avro/blob/master/README.md, Union type doesn’t support all kind of combination. Is there any plan to support union type having string & long in near future? Thanks Shagun Agarwal

Re: HDFS Rest Service not available

2015-06-02 Thread Akhil Das
It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and then sbin/start-dfs.sh Thanks Best Regards On Tue, Jun 2, 2015 at 5:03 AM, Su She wrote: > Hello All, > > A bit scared I did something stupid...I

Re: Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-02 Thread Akhil Das
You can try to skip the tests, try with: mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package Thanks Best Regards On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch wrote: > I downloaded the 1.3.1 distro tarball > > $ll ../spark-1.3.1.tar.gz > -rw-r-@ 1 steve staff 8500861 Apr 23 0

Re: using pyspark with standalone cluster

2015-06-02 Thread Akhil Das
If you want to submit applications to a remote cluster where your port 7077 is opened publically, then you would need to set the *spark.driver.host *(with the public ip of your laptop) and *spark.driver.port* (optional, if there's no firewall between your laptop and the remote cluster). Keeping you