Interacting with Different Versions of Hive Metastore, how to config?

2015-09-14 Thread bg_spark
spark.sql.hive.metastore.version0.13.1 Version of the Hive metastore. Available options are 0.12.0 through 1.2.1. spark.sql.hive.metastore.jars builtin Location of the jars that should be used to instantiate the HiveMetastoreClient. This property can be one of three options:

Re: Best way to merge final output part files created by Spark job

2015-09-14 Thread Gaspar Muñoz
Hi, check out FileUtil.copyMerge function in the Hadoop API

Re: MLlib LDA implementation questions

2015-09-14 Thread Marko Asplund
On 2015-09-11 at 16:08:44 +0200 Carsten Schnober wrote: Hi, > I don't have practical experience with the MLlib LDA implementation, but > regarding the variations in the topic matrix: LDA make use of stochastic > processes. If you use setSeed(seed) with the same value for seed during > initializati

Re: How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang
I think I've found the reason. It seems that the the smallest offset is not 0 and I should not set the offset to 0. Bin Wang 于2015年9月14日周一 下午2:46写道: > Hi, > > I'm using spark streaming with kafka and I need to clear the offset and > re-compute all things. I deleted checkpoint directory in HDFS an

Re: Spark Streaming..Exception

2015-09-14 Thread Priya Ch
Hi All, I came across the related old conversation on the above issue ( https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed? I tried different values for spark.cleaner.ttl -> 0sec, -1sec, 2000sec,..none of them worked. I also tried setting spark.streaming.unpersist -> true. Wh

DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread petranidis
Hi all, I am new to spark and I have writen a few spark programs mostly around machine learning applications. I am trying to resolve a particular problem where there are two RDDs that should be updated by using elements of each other. More specifically, if the two pair RDDs are called A and B M

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Sean Owen
There isn't a cycle in your graph, since although you reuse reference variables in your code called A and B you are in fact creating new RDDs at each operation. You have some other problem, and you'd have to provide detail on why you think something is deadlocked, like a thread dump. On Mon, Sep 1

Spark Streaming Topology

2015-09-14 Thread defstat
Hi all, I would like to use Spark Streaming for managing the problem below: I have 2 InputStreams, one for one type of input (n-dimensional vectors) and one for question on the infrastructure (explained below). I need to "break" the input first in 4 execution nodes, and produce a stream from ea

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Petros Nyfantis
Hi Sean thanks a lot for your reply, yes I understand that as scala is a functional language maps correspond to transforms of immutable objects but the behavior of the program seems like a deadlock as it simply does not proceed beyond the B = B.map (A.aggregate) stage my Spark Web interface show

Re: Problems with Local Checkpoints

2015-09-14 Thread Akhil Das
You need to set your HADOOP_HOME and make sure the winutils.exe is available in the PATH. Here's a discussion around the same issue http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path Also this JIRA https://issues.apache.org/jira/browse/SPARK-

RE: Problems with Local Checkpoints

2015-09-14 Thread Bryan
Akhil, This looks like the issue. I'll update my path to include the (soon to be added) winutils & assoc. DLLs. Thank you, Bryan -Original Message- From: "Akhil Das" Sent: ‎9/‎14/‎2015 6:46 AM To: "Bryan Jeffrey" Cc: "user" Subject: Re: Problems with Local Checkpoints You need to s

Re: SparkR - Support for Other Models

2015-09-14 Thread Akhil Das
You can look into the Spark JIRA page for the same, if it isn't available there then you could possibly create an issue for support and hopefully in later releases it will be added. Thanks Best Regards On Thu, Sep 10, 2015 at 11:26 AM, Manish MAHESHW

Re: Implement "LIKE" in SparkSQL

2015-09-14 Thread Jorge Sánchez
I think after you get your table as a DataFrame, you can do a filter over it, something like: val t = sqlContext.sql("select * from table t") val df = t.filter(t("a").contains(t("b"))) Let us know the results. 2015-09-12 10:45 GMT+01:00 liam : > > OK, I got another way, it looks silly and low i

Re: Spark task hangs infinitely when accessing S3

2015-09-14 Thread Akhil Das
Are you sitting behind a proxy or something? Can you look more into the executor logs? I have a strange feeling that you are blowing the memory (and possibly hitting GC etc). Thanks Best Regards On Thu, Sep 10, 2015 at 10:05 PM, Mario Pastorelli < mario.pastore...@teralytics.ch> wrote: > Dear co

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread Akhil Das
You can look into this doc regarding the connection (its for gce though but it should be similar). Thanks Best Regards On Thu, Sep 10, 2015 at 11:20 PM, roni wrote: > I have spark installed on a EC2 cluster. Can I connect to that from my

Re: java.lang.NullPointerException with Twitter API

2015-09-14 Thread Akhil Das
Some status might not have the geoLocation and hence you are doing a null.toString.contains which ends up in that exception, put a condition or try...catch around it to make it work. Thanks Best Regards On Fri, Sep 11, 2015 at 12:59 AM, Jo Sunad wrote: > Hello! > > I am trying to customize the

Spark Streaming Topology

2015-09-14 Thread defstat
Hi all, I would like to use Spark Streaming for managing the problem below: I have 2 InputStreams, one for one type of input (n-dimensional vectors) and one for question on the infrastructure (explained below). I need to "break" the input first in 4 execution nodes, and produce a stream from

Re: Spark Streaming..Exception

2015-09-14 Thread Akhil Das
You should consider upgrading your spark from 1.3.0 to a higher version. Thanks Best Regards On Mon, Sep 14, 2015 at 2:28 PM, Priya Ch wrote: > Hi All, > > I came across the related old conversation on the above issue ( > https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed?

Re: Replacing Esper with Spark Streaming?

2015-09-14 Thread Todd Nist
Stratio offers a CEP implementation based on Spark Streaming and the Siddhi CEP engine. I have not used the below, but they may be of some value to you: http://stratio.github.io/streaming-cep-engine/ https://github.com/Stratio/streaming-cep-engine HTH. -Todd On Sun, Sep 13, 2015 at 7:49 PM, O

application failed on large dataset

2015-09-14 Thread 周千昊
Hi, community I am facing a strange problem: all executors does not respond, and then all of them failed with the ExecutorLostFailure. when I look into yarn logs, there are full of such exception 15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetc

Fwd: Spark job failed

2015-09-14 Thread Renu Yadav
-- Forwarded message -- From: Renu Yadav Date: Mon, Sep 14, 2015 at 4:51 PM Subject: Spark job failed To: d...@spark.apache.org I am getting below error while running spark job: storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /data/vol5/h

Twitter Streming using Twitter Public Streaming API and Apache Spark

2015-09-14 Thread Sadaf
Hi, I wanna fetch PUBLIC tweets (not particular to any account) containing any particular HASHTAG (#) (i.e "CocaCola" in my case) from twitter. I made an APP on twitter to get the credentials, and then used Twitter Public Streaming API. Below is the piece of code. { val config = new twitter4j.

hdfs-ha on mesos - odd bug

2015-09-14 Thread Adrian Bridgett
I'm hitting an odd issue with running spark on mesos together with HA-HDFS, with an even odder workaround. In particular I get an error that it can't find the HDFS nameservice unless I put in a _broken_ url (discovered that workaround by mistake!). core-site.xml, hdfs-site.xml is distributed

Re: Spark job failed

2015-09-14 Thread Ted Yu
Have you considered posting on vendor forum ? FYI On Mon, Sep 14, 2015 at 6:09 AM, Renu Yadav wrote: > > -- Forwarded message -- > From: Renu Yadav > Date: Mon, Sep 14, 2015 at 4:51 PM > Subject: Spark job failed > To: d...@spark.apache.org > > > I am getting below error while

Mailing List - This post has NOT been accepted by the mailing list yet.

2015-09-14 Thread defstat
Why do I get the "This post has NOT been accepted by the mailing list yet." message, even though I have been subscribed to this mailing list? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mailing-List-This-post-has-NOT-been-accepted-by-the-mailing

Approach validation - building merged datasets - Spark SQL

2015-09-14 Thread Vajra L
Folks- I am very new to Spark and Spark-SQL. Here is what I am doing in my application. Can you please validate and let me know if there is a better way? 1. Parsing XML files with nested structures, ingested, into individual datasets Created a custom input format to split XML so each node beco

Where can I learn how to write udf?

2015-09-14 Thread Saif.A.Ellafi
Hi all, I am failing to find a proper guide or tutorial onto how to write proper udf functions in scala. Appreciate the effort saving, Saif

Re: Where can I learn how to write udf?

2015-09-14 Thread Silvio Fiorito
Hi Saif, There are 2 types of UDFs. Those used by SQL and those used by the Scala DSL. For SQL, you just register a function like so (this example is from the docs): sqlContext.udf.register(“strLen”, (s: String) => s.length) sqlContext.sql(“select name, strLen(name) from people”).show The othe

JavaRDD using Reflection

2015-09-14 Thread Rachana Srivastava
Hello all, I am working a problem that requires us to create different set of JavaRDD based on different input arguments. We are getting following error when we try to use a factory to create JavaRDD. Error message is clear but I am wondering is there any workaround. Question: How to create

Creating fat jar with all resources.(Spark-Java-Maven)

2015-09-14 Thread Vipul Rai
HI All, I have a spark app written in java,which parses the incoming log using the headers which are in .xml. (There are many headers and logs are from 15-20 devices in various formats and separators). I am able to run it in local mode after specifying all the resources and passing it as paramete

Creating fat jar with all resources.(Spark-Java-Maven)

2015-09-14 Thread vipulrai
HI All, I have a spark app written in java,which parses the incoming log using the headers which are in .xml. (There are many headers and logs are from 15-20 devices in various formats and separators). I am able to run it in local mode after specifying all the resources and passing it as paramete

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that w

Re: JavaRDD using Reflection

2015-09-14 Thread Ankur Srivastava
Hi Rachana I didn't get you r question fully but as the error says you can not perform a rdd transformation or action inside another transformation. In your example you are performing an action "rdd2.values.count()" in side the "map" transformation. It is not allowed and in any case this will be v

Parse tab seperated file inc json efficent

2015-09-14 Thread matthes
I try to parse a tab seperated file in Spark 1.5 with a json section as efficent as possible. The file looks like follows: value1value2{json} How can I parse all fields inc the json fields into a RDD directly? If I use this peace of code: val jsonCol = sc.textFile("/data/input").map(l => l.spl

Trouble using dynamic allocation and shuffle service.

2015-09-14 Thread Philip Weaver
Hello, I am trying to use dynamic allocation which requires the shuffle service. I am running Spark on mesos. Whenever I set spark.shuffle.service.enabled=true, my Spark driver fails with an error like this: Caused by: java.net.ConnectException: Connection refused: devspark1/ 172.26.21.70:7337 I

Re: Trouble using dynamic allocation and shuffle service.

2015-09-14 Thread Tim Chen
Hi Philip, I've included documentation in the Spark/Mesos doc ( http://spark.apache.org/docs/latest/running-on-mesos.html), where you can start the MesosShuffleService with sbin/start-mesos-shuffle-service.sh script. The shuffle service needs to be started manually for Mesos on each slave (one wa

Re: Trouble using dynamic allocation and shuffle service.

2015-09-14 Thread Philip Weaver
Ah, I missed that, thanks! On Mon, Sep 14, 2015 at 11:45 AM, Tim Chen wrote: > Hi Philip, > > I've included documentation in the Spark/Mesos doc ( > http://spark.apache.org/docs/latest/running-on-mesos.html), where you can > start the MesosShuffleService with sbin/start-mesos-shuffle-service.sh

Spark Streaming application code change and stateful transformations

2015-09-14 Thread Ofir Kerker
Hi, My Spark Streaming application consumes messages (events) from Kafka every 10 seconds using the direct stream approach and aggregates these messages into hourly aggregations (to answer analytics questions like: "How many users from Paris visited page X between 8PM to 9PM") and save the data to

Re: Spark Streaming application code change and stateful transformations

2015-09-14 Thread Cody Koeninger
Solution 2 sounds better to me. You aren't always going to have graceful shutdowns. On Mon, Sep 14, 2015 at 1:49 PM, Ofir Kerker wrote: > Hi, > My Spark Streaming application consumes messages (events) from Kafka every > 10 seconds using the direct stream approach and aggregates these messages

Re: JavaRDD using Reflection

2015-09-14 Thread Ajay Singal
Hello Rachana, The easiest way would be to start with creating a 'parent' JavaRDD and run different filters (based on different input arguments) to create respective 'child' JavaRDDs dynamically. Notice that the creation of these children RDDs is handled by the application driver. Hope this help

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Ricardo Paiva
Hi Cody, Thanks for your answer. I had already tried to change the spark submit parameters, but I double checked to reply your answer. Even changing properties file or directly on the spark-submit arguments, none of them work when the application runs from the checkpoint. It seems that everything

unoin streams not working for streams > 3

2015-09-14 Thread Василец Дмитрий
hello I have 4 streams from kafka and streaming not working. without any errors or logs but with 3 streams everything perfect. make sense only amount of streams , different triple combinations always working. any ideas how to debug or fix it ?

Re: JavaRDD using Reflection

2015-09-14 Thread Ankur Srivastava
It is not reflection that is the issue here but use of an RDD transformation "featureKeyClassPair.map" inside "lines.mapToPair". >From the code snippet you have sent it is not very clear if getFeatureScore(id,data) invokes executeFeedFeatures, but if that is the case it is not very obvious that “d

using existing R packages from SparkR

2015-09-14 Thread bobtreacy
I am trying to use an existing R package in SparkR. I am trying to follow the example at https://amplab-extras.github.io/SparkR-pkg/ in the section "Using existing R packages". Here is the sample in ample extras -- generateSparse <- function(x) { # Use sparseMatrix function from the Matrix p

Re: unoin streams not working for streams > 3

2015-09-14 Thread Gerard Maas
How many cores are you assigning to your spark streaming job? On Mon, Sep 14, 2015 at 10:33 PM, Василец Дмитрий wrote: > hello > I have 4 streams from kafka and streaming not working. > without any errors or logs > but with 3 streams everything perfect. > make sense only amount of streams , diff

Re: hdfs-ha on mesos - odd bug

2015-09-14 Thread Sam Bessalah
I don't know about the broken url. But are you running HDFS as a mesos framework? If so is it using mesos-dns? Then you should resolve the namenode via hdfs:/// On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett wrote: > I'm hitting an odd issue with running spark on mesos together with > HA-

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Cody Koeninger
Yeah, looks like you're right about being unable to change those. Upon further reading, even though StreamingContext.getOrCreate makes an entirely new spark conf, Checkpoint will only reload certain properties. I'm not sure if it'd be safe to include memory / cores among those properties that get

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread roni
Thanks Akhil. Very good article. On Mon, Sep 14, 2015 at 4:15 AM, Akhil Das wrote: > You can look into this doc > regarding the > connection (its for gce though but it should be similar). > > Thanks > Best Regards > > On Thu, Sep 10, 2015

Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai
btw, move it to user list. On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai wrote: > A scale of 10 means that there are 10 digits at the right of the decimal > point. If you also have precision 10, the range of your data will be [0, 1) > and casting "10.5" to DecimalType(10, 10) will return null, which

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Ricardo Paiva
Thanks Cody. You confirmed that I'm not doing something wrong. I will keep investigating and if I find something I let everybody know. Thanks again. Regards, Ricardo On Mon, Sep 14, 2015 at 6:29 PM, Cody Koeninger wrote: > Yeah, looks like you're right about being unable to change those. Up

add external jar file to Spark shell vs. Scala Shell

2015-09-14 Thread Lan Jiang
Hi, there I ran into a problem when I try to pass external jar file to spark-shell. I have a uber jar file that contains all the java codes I created for protobuf and all its dependency. If I simply execute my code using Scala Shell, it works fine without error. I use -cp to pass the extern

How to convert dataframe to a nested StructType schema

2015-09-14 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to confor

RE: Best way to merge final output part files created by Spark job

2015-09-14 Thread java8964
For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified A

Spark aggregateByKey Issues

2015-09-14 Thread 毕岩
Hi: There is such one case about using reduce operation like that: I Need to reduce large data made up of billions of records with a Key-Value pair. For the following: *First,group by Key, and the records with the same Key need to be in order of one field called “date” in Value* *Sec

Re: unoin streams not working for streams > 3

2015-09-14 Thread Василец Дмитрий
I use local[*]. And i have 4 cores on laptop. On 14 Sep 2015 23:19, "Gerard Maas" wrote: > How many cores are you assigning to your spark streaming job? > > On Mon, Sep 14, 2015 at 10:33 PM, Василец Дмитрий < > pronix.serv...@gmail.com> wrote: > >> hello >> I have 4 streams from kafka and streami

Spark Streaming Suggestion

2015-09-14 Thread srungarapu vamsi
I am pretty new to spark. Please suggest a better model for the following use case. I have few (about 1500) devices in field which keep emitting about 100KB of data every minute. The nature of data sent by the devices is just a list of numbers. As of now, we have Storm is in the architecture which

Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hi all, I have a question regarding the ability of ML pipeline to cache intermediate results. I've posted this question on stackoverflow but got no answer, hope someone here can help me out. ==

Re: Spark aggregateByKey Issues

2015-09-14 Thread Alexis Gillain
I'm not sure about what you want to do. You should try to sort the RDD by (yourKey, date), it ensures that all the keys are in the same partition. You problem after that is you want to aggregate only on yourKey and if you change the Key of the sorted RDD you loose partitionning. Depending of the

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
Hi, there, I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. However, I would like to use Protobuf 3 in my spark application so that I can use some new features such as Map support. Is there anyway to do that? Right now if I build a uber.jar with dependencies includ

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis, Many pipeline stages implement save/load methods, which can be used if you instantiate and call the underlying pipeline stages `transform` methods individually (instead of using the Pipeline.setStages API). See associated JIRAs . Pipeline p

Setting Executor memory

2015-09-14 Thread Thomas Gerber
Hello, I was looking for guidelines on what value to set executor memory to (via spark.executor.memory for example). This seems to be important to avoid OOM during tasks, especially in no swap environments (like AWS EMR clusters). This setting is really about the executor JVM heap. Hence, in ord

Re: Spark Streaming Suggestion

2015-09-14 Thread Jörn Franke
Why did you not stay with the batch approach? For me the architecture looks very complex for a simple thing you want to achieve. Why don't you process the data already in storm ? Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi a écrit : > I am pretty new to spark. Please suggest a better model fo

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hey Feynman, Thanks for your response, but I'm afraid "model save/load" is not exactly the feature I'm looking for. What I need to cache and reuse are the intermediate outputs of transformations, not transformer themselves. Do you know any related dev. activities or plans? Best, Lewis 2015-09-1

Re: Spark aggregateByKey Issues

2015-09-14 Thread biyan900116
Hi Alexis: Thank you for your replying. My case is that each operation to one record need to depend on one value that will be set by the operating to the last record. So your advise is that i can use “sortByKey”. “sortByKey” will put all records with the same Key in one partition. Need I take

why spark and kafka always crash

2015-09-14 Thread Joanne Contact
How to prevent it? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
You can persist the transformed Dataframes, for example val data : DF = ... val hashedData = hashingTF.transform(data) hashedData.cache() // to cache DataFrame in memory Future usage of hashedData read from an in-memory cache now. You can also persist to disk, eg: hashedData.write.parquet(FileP

Re: Spark Streaming Suggestion

2015-09-14 Thread srungarapu vamsi
The batch approach i had implemented takes about 10 minutes to complete all the pre-computation tasks for the one hour worth of data. When i went through my code, i figured out that most of the time consuming tasks are the ones, which read data from cassandra and the places where i perform sparkCon

[ANNOUNCE] Apache Gora 0.6.1 Release

2015-09-14 Thread lewis john mcgibbney
Hi All, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.6.1. What is Gora? Gora is a framework which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and

How to speed up MLlib LDA?

2015-09-14 Thread Marko Asplund
Hi, I'm trying out MLlib LDA training with 100 topics, 105 K vocabulary size and ~3.4 M documents using EMLDAOptimizer. Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit training with the same data and on the same system set took ~5 minutes. I realize that there are diffe

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Akhil Das
As of now i think its a no. Not sure if its a naive approach, but yes you can have a separate program to keep an eye in the webui (possibly parsing the content) and make it trigger the kill task/job once it detects a lag. (Again you will have to figure out the correct numbers before killing any job

Re: why spark and kafka always crash

2015-09-14 Thread Akhil Das
Can you be more precise? Thanks Best Regards On Tue, Sep 15, 2015 at 11:28 AM, Joanne Contact wrote: > How to prevent it? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user