Re: DirectFileOutputCommiter

2016-02-29 Thread Takeshi Yamamuro
Hi, I think the essential culprit is that these committers are not idempotent; retry attempts will fail. See codes below for details; https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L130 On Sat, Feb 27, 2016 at 7

Re: Spark Integration Patterns

2016-02-29 Thread moshir mikael
Well, I have a personal project where I want to build a *spreadsheet *on top of spark. I have a version of my app running on postgresql, which does not scale, and would like to move data processing to spark. You can import data, explore data, analyze data, visualize data ... You don't need to be a

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread Ashok Kumar
Thank you all for valuable advice. Much appreciated Best On Sunday, 28 February 2016, 21:48, Ashok Kumar wrote:   Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and act

Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir, I think you can use the rest api provided with Spark: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala Unfortunately, I haven't find any documentation, but it looks fine. Thanks, Alex. On Sun, Feb 28, 2016 at 3:25

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-29 Thread Abhishek Anand
Hi Ryan, I was able to resolve this issue. The /tmp location was mounted with "noexec" option. Removing this noexec in the fstab resolved the issue. The snappy shared object file is created at the /tmp location so either removing the noexec from mount or changing the default temp location solved t

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Abhishek Anand
Hi Ryan, Its not working even after removing the reduceByKey. So, basically I am doing the following - reading from kafka - flatmap inside transform - mapWithState - rdd.count on output of mapWithState But to my surprise still dont see checkpointing taking place. Is there any restriction to the

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread charles li
since spark is under actively developing, so take a book to learn it is somehow outdated to some degree. I would like to suggest learn it from several ways as bellow: - spark official document, trust me, you will go through this for several time if you want to learn in well : http://spark.

[Error]: Spark 1.5.2 + HiveHbase Integration

2016-02-29 Thread Divya Gehlot
Hi, I am trying to access hive table which been created using HbaseIntegration I am able to access data in Hive CLI But when I am trying to access the table using hivecontext of Spark getting following

Unresolved dep when building project with spark 1.6

2016-02-29 Thread Hao Ren
Hi, I am upgrading my project to spark 1.6. It seems that the deps are broken. Deps used in sbt val scalaVersion = "2.10" val sparkVersion = "1.6.0" val hadoopVersion = "2.7.1" // Libraries val scalaTest = "org.scalatest" %% "scalatest" % "2.2.4" % "test" val sparkSql = "org.apache.spark" %%

Re: DirectFileOutputCommiter

2016-02-29 Thread Steve Loughran
> On 26 Feb 2016, at 06:24, Takeshi Yamamuro wrote: > > Hi, > > Great work! > What is the concrete performance gain of the committer on s3? > I'd like to know. > > I think there is no direct committer for files because these kinds of > committer has risks > to loss data (See: SPARK-10063). >

[Help]: Steps to access hive table + Spark 1.5.2 + HbaseIntegration + Hive 1.2 + Hbase 1.1

2016-02-29 Thread Divya Gehlot
Hi, Can anybody help me by sharing the steps/examples How to connect to hive table(which is being created using HbaseIntegration ) through hivecontext in Spark I googled but couldnt find a single example/document . Would really a

What is the best approach to perform concurrent updates from different jobs to a in memory dataframe registered as a temp table?

2016-02-29 Thread Roger Marin
Hi all, I have multiple (>100) jobs running concurrently (sharing the same hive context) that are each appending new rows to the same dataframe registered as a temp table. Currently I am using unionAll and registering that dataframe again as a temp table in each job: Given an existing dataframe

spark lda runs out of disk space

2016-02-29 Thread TheGeorge1918 .
Hi guys I was running lda with 2000 topics on 6G compressed data, roughly 1.2 million docs. I used aws 3 r3.8xlarge machines as core nodes. It turned out spark applications crushed after 3 or 4 iterations. From ganglia, it indicated the disk space was all consumed. I believe it’s the shuffle data

Deadlock between UnifiedMemoryManager and BlockManager

2016-02-29 Thread Sea
Hi??all?? My spark version is 1.6.0, I found a deadlock in production environment, Anyone can help? I create an issue in jira: https://issues.apache.org/jira/browse/SPARK-13566 === "block-manager-slave-async-thread-pool-1": at org.apach

Implementation of random algorithm walk in spark

2016-02-29 Thread naveenkumarmarri
Hi, I'm new to spark, I'm trying to compute similarity between users/products. I've a huge table which I can't do a self join with the cluster I have. I'm trying to implement do self join using random walk methodology which will approximately give the results. The table is a bipartite graph with

Implementation of random algorithm walk in spark

2016-02-29 Thread naveen.marri
Hi, I'm new to spark, I'm trying to compute similarity between users/products. I've a huge table which I can't do a self join with the cluster I have. I'm trying to implement do self join using random walk methodology which will approximately give the results. The table is a bipartite graph with

Spark on Windows platform

2016-02-29 Thread gaurav pathak
Can someone guide me the steps and information regarding, installation of SPARK on Windows 7/8.1/10 , as well as on Windows Server. Also, it will be great to read your experiences in using SPARK on Windows platform. Thanks & Regards, Gaurav Pathak

Re: [Error]: Spark 1.5.2 + HiveHbase Integration

2016-02-29 Thread Ted Yu
Divya: Please try not to cross post your question. In your case HBase-common jar is needed. To find all the hbase jars needed, you can run 'mvn dependency:tree' and check its output. > On Feb 29, 2016, at 1:48 AM, Divya Gehlot wrote: > > Hi, > I am trying to access hive table which been crea

Re: Spark on Windows platform

2016-02-29 Thread Jörn Franke
I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as well? > On 29 Feb 2016, at 14:27, gaurav pathak wrote: > > Can someone guide me the steps and information regarding, installation of > SPARK on Windows 7/8.1/10 , as well as on Windows Server. Also, it will be > great to re

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Lior Chaga
Hi Koret, Try spark.shuffle.reduceLocality.enabled=false This is an undocumented configuration. See: https://github.com/apache/spark/pull/8280 https://issues.apache.org/jira/browse/SPARK-10567 It solved the problem for me (both with and without memory legacy mode) On Sun, Feb 28, 2016 at 11:16 P

Re: Spark on Windows platform

2016-02-29 Thread gaurav pathak
Thanks Jorn. Any guidance on how to get started with getting SPARK on Windows, is highly appreciated. Thanks & Regards Gaurav Pathak ~ sent from handheld device On Feb 29, 2016 5:34 AM, "Jörn Franke" wrote: > I think Hortonworks has a Windows Spark distribution. Maybe Bigtop as well? > > > On

Re: Spark Integration Patterns

2016-02-29 Thread moshir mikael
Hi Alex, thanks for the link. Will check it. Does someone know of a more streamlined approach ? Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev a écrit : > Hi Moshir, > > I think you can use the rest api provided with Spark: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/

Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir, Regarding the streaming, you can take a look at the spark streaming, the micro-batching framework. If it satisfies your needs it has a bunch of integrations. Thus, the source for the jobs could be Kafka, Flume or Akka. Cheers, Alex. On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael wrot

Re: Spark Integration Patterns

2016-02-29 Thread moshir mikael
Thanks, will check too, however : just want to use Spark core RDD and standard data sources. Le lun. 29 févr. 2016 à 14:54, Alex Dzhagriev a écrit : > Hi Moshir, > > Regarding the streaming, you can take a look at the spark streaming, the > micro-batching framework. If it satisfies your needs i

kafka + mysql filtering problem

2016-02-29 Thread franco barrientos
Hi all, I want to read some filtering rules from mysql (jdbc mysql driver) specifically its a char type containing a field and value to process in a kafka streaming input. The main idea is to process this from a web UI (livy server). Any suggestion or guidelines? e.g., I have this: object St

Re: Spark on Windows platform

2016-02-29 Thread Gaurav Agarwal
> Hi > I am running spark on windows but a standalone one. > > Use this code > > SparkConf conf = new SparkConf().setMaster("local[1]").seatAppName("spark").setSparkHome("c:/spark/bin/spark-submit.cmd"); > > Where sparkhome is the path where u extracted ur spark binaries till bin/*.cmd > > You will

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Yin Yang
The default value for spark.shuffle.reduceLocality.enabled is true. To reduce surprise to users of 1.5 and earlier releases, should the default value be set to false ? On Mon, Feb 29, 2016 at 5:38 AM, Lior Chaga wrote: > Hi Koret, > Try spark.shuffle.reduceLocality.enabled=false > This is an un

[MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Hello guys, I think the default Loss algorithm is Squared Error for regression, but how do I change that to Absolute Error in Java. Could you please show me an example?

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
You can use the constructor that accepts a BoostingStrategy object, which will allow you to set the tree strategy (and other hyperparameters as well). *GradientBoostedTrees

Re: kafka + mysql filtering problem

2016-02-29 Thread Cody Koeninger
You're getting confused about what code is running on the driver vs what code is running on the executor. Read http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka On Mon, Feb 29, 2016 at 8:00 AM, franco barrientos < franco.barrien...@exalitica.

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Hi Kevin, Yes, I've set the bootingStrategy like that using the example. But I'm not sure how to create and pass the Loss object. e.g boostingStrategy.setLoss(..); Not sure how to pass the selected Loss. How do I set the Absolute Error in setLoss() function? On 29 February 2016 at 15:

Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni
Yes i read that and not much details here. Is it true that we need to have spark installed on each mesos docker container ( master and slave ) ... Ashish On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen wrote: > https://spark.apache.org/docs/latest/running-on-mesos.html should be the > best source, w

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
I believe that you can instantiate an instance of the AbsoluteError class for the *Loss* object, since that object implements the Loss interface. For example. val loss = new AbsoluteError() boostingStrategy.setLoss(loss) On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru wrote: > Hi Kevin, > > Ye

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
AbsoluteError() constructor is undefined. I'm using Spark 1.3.0, maybe it is not ready for this version? On 29 February 2016 at 15:38, Kevin Mellott wrote: > I believe that you can instantiate an instance of the AbsoluteError class > for the *Loss* object, since that object implements the Los

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
Looks like it should be present in 1.3 at org.apache.spark.mllib.tree.loss.AbsoluteError spark.apache.org/docs/1.3.0/api/java/org/apache/spark/mllib/tree/loss/AbsoluteError.html On Mon, Feb 29, 2016 at 9:46 AM, diplomatic Guru wrote: > AbsoluteError() constructor is undefined. > > I'm using Spa

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
It's strange as you are correct the doc does state it. But it's complaining about the constructor. When I clicked on the org.apache.spark.mllib.tree.loss.AbsoluteError class, this is what I see: @Since("1.2.0") @DeveloperApi object AbsoluteError extends Loss { /** * Method to calculate the

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
I found a helper class that I think should do the trick. Take a look at https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Losses.scala When passing the Loss, you should be able to do something like: Losses.fromString("leastSquaresError") On Mon, Fe

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread diplomatic Guru
Thank you very much Kevin. On 29 February 2016 at 16:20, Kevin Mellott wrote: > I found a helper class that I think should do the trick. Take a look at > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Losses.scala > > When passing the Loss, yo

Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Fellow Sparkers, I'm trying to "flatten" my view of data within a DataFrame, and am having difficulties doing so. The DataFrame contains product information, which includes multiple levels of categories (primary, secondary, etc). *Example Data (Raw):* *NameLevelCat

Optimizing cartesian product using keys

2016-02-29 Thread eahlberg
Hello, To avoid computing all possible combinations, I'm trying to group values according to a certain key, and then compute the cartesian product of the values for each key, i.e.: Input [(k1, [v1]), (k1, [v2]), (k2, [v3])] Desired output: [(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)] Cur

Re: Spark 1.5 on Mesos

2016-02-29 Thread Tim Chen
No you don't have to run Mesos in docker containers to run Spark in docker containers. Once you have Mesos cluster running you can then specfiy the Spark configurations in your Spark job (i.e: spark.mesos.executor.docker.image=mesosphere/spark:1.6) and Mesos will automatically launch docker contai

Re: Flattening Data within DataFrames

2016-02-29 Thread Michał Zieliński
Hi Kevin, This should help: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html On 29 February 2016 at 16:54, Kevin Mellott wrote: > Fellow Sparkers, > > I'm trying to "flatten" my view of data within a DataFrame, and am having > difficulties doing so. The DataFrame c

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Koert Kuipers
setting spark.shuffle.reduceLocality.enabled=false worked for me, thanks is there any reference to the benefits of setting reduceLocality to true? i am tempted to disable it across the board. On Mon, Feb 29, 2016 at 9:51 AM, Yin Yang wrote: > The default value for spark.shuffle.reduceLocality.

Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni
What is the Best practice , I have everything running as docker container in single host ( mesos and marathon also as docker container ) and everything comes up fine but when i try to launch the spark shell i get below error SQL context available as sqlContext. scala> val data = sc.parallelize(

Re: Unresolved dep when building project with spark 1.6

2016-02-29 Thread Josh Rosen
Have you tried removing the leveldbjni files from your local ivy cache? My hunch is that this is a problem with some local cache state rather than the dependency simply being unavailable / not existing (note that the error message was "origin location must be absolute:[...]", not that the files cou

Re: Spark 1.5 on Mesos

2016-02-29 Thread Sathish Kumaran Vairavelu
May be the Mesos executor couldn't find spark image or the constraints are not satisfied. Check your Mesos UI if you see Spark application in the Frameworks tab On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni wrote: > What is the Best practice , I have everything running as docker container > in sin

Re: LDA topic Modeling spark + python

2016-02-29 Thread Bryan Cutler
The input into LDA.train needs to be an RDD of a list with the first element an integer (id) and the second a pyspark.mllib.Vector object containing real numbers (term counts), i.e. an RDD of [doc_id, vector_of_counts]. >From your example, it looks like your corpus is a list with an zero-based id,

Spark for client

2016-02-29 Thread Mich Talebzadeh
Hi, Is there such thing as Spark for client much like RDBMS client that have cut down version of their big brother useful for client connectivity but cannot be used as server. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8

Re: Spark for client

2016-02-29 Thread Sabarish Sasidharan
Zeppelin? Regards Sab On 01-Mar-2016 12:27 am, "Mich Talebzadeh" wrote: > Hi, > > Is there such thing as Spark for client much like RDBMS client that have > cut down version of their big brother useful for client connectivity but > cannot be used as server. > > Thanks > > > Dr Mich Talebzadeh >

Re: Spark for client

2016-02-29 Thread Minudika Malshan
Hi, I think zeppelin spark interpreter will give a solution to your problem. Regards. Minudika Minudika Malshan Undergraduate Department of Computer Science and Engineering University of Moratuwa. *Mobile : +94715659887* *LinkedIn* : https://lk.linkedin.com/in/minudika On Tue, Mar 1, 2016 at

Re: Spark for client

2016-02-29 Thread Minudika Malshan
+Adding resources https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html https://zeppelin.incubator.apache.org Minudika Malshan Undergraduate Department of Computer Science and Engineering University of Moratuwa. *Mobile : +94715659887* *LinkedIn* : https://lk.linkedin.com/in/min

RE: Spark Integration Patterns

2016-02-29 Thread skaarthik oss
Check out http://toree.incubator.apache.org/. It might help with your need. From: moshir mikael [mailto:moshir.mik...@gmail.com] Sent: Monday, February 29, 2016 5:58 AM To: Alex Dzhagriev Cc: user Subject: Re: Spark Integration Patterns Thanks, will check too, however : just want to use

Re: Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Thanks Michal - this is exactly what I need. On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Hi Kevin, > > This should help: > > https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html > > On 29 February 2016 at 16:54, Kevin Mellot

Re: Spark Integration Patterns

2016-02-29 Thread Alexander Pivovarov
There is a spark-jobserver (SJS) which is REST interface for spark and spark-sql you can deploy your jar file with Jobs impl to spark-jobserver and use rest API to submit jobs in synch or async mode in sync mode you need to poll SJS to get job result job result might be actual data in json or path

perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Vinti Maheshwari
Hi All, I wrote kafka producer using kafka perl api, But i am getting error when i am passing variable for sending message while if i am hard coding the message data it's not giving any error. Perl program, where i added kafka producer code: try { $kafka_c

RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
Hi Jules, folks, I have tried “hdfs://” as well as “file://”. And several variants. Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get such a msg, if the problem is simply that Spark cannot find the text file? Doesn’t the error msg indicate some other source of the

Re: perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Cody Koeninger
Does this issue involve Spark at all? Otherwise you may have better luck on a perl or kafka related list. On Mon, Feb 29, 2016 at 3:26 PM, Vinti Maheshwari wrote: > Hi All, > > I wrote kafka producer using kafka perl api, But i am getting error when i > am passing variable for sending message w

Re: Spark on Windows platform

2016-02-29 Thread Steve Loughran
On 29 Feb 2016, at 13:40, gaurav pathak mailto:gauravpathak...@gmail.com>> wrote: Thanks Jorn. Any guidance on how to get started with getting SPARK on Windows, is highly appreciated. Thanks & Regards Gaurav Pathak you are at risk of seeing stack traces when you try to talk to the local

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Yin Yang
RDDOperationScope is in spark-core_2.1x jar file. 7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class Can you check whether the spark-core jar is in classpath ? FYI On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C wrote: > Hi Jules, folks, > > > > I have tried “h

RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
HI Yin, My Classpath is set to: CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:. And there is indeed a spark-core.jar in the ../jars subdirectory, though it is not named precisely “spark-core.jar”. It has a version number in its name, as

RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
I guess I should also point out that I do an export CLASSPATH in my .bash_profile file, so the CLASSPATH info should be usable by the PySpark shell that I invoke. Ron Ronald C. Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energ

Re: Spark for client

2016-02-29 Thread Mich Talebzadeh
Thank you very much both Zeppelin looks promising. Basically as I understand runs an agent on a given port (I chose 21999) on the host that Spark is installed. I created a notebook and running scripts through there. One thing for sure notebook just returns the results rather all other stuff that o

Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sumona Routh
Hi there, I've been doing some performance tuning of our Spark application, which is using Spark 1.2.1 standalone. I have been using the spark metrics to graph out details as I run the jobs, as well as the UI to review the tasks and stages. I notice that after my application completes, or is near

Re: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Shixiong(Ryan) Zhu
Do you mean you cannot access Master UI after your application completes? Could you check the master log? On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh wrote: > Hi there, > I've been doing some performance tuning of our Spark application, which is > using Spark 1.2.1 standalone. I have been usin

DataSet Evidence

2016-02-29 Thread Steve Lewis
I have a relatively complex Java object that I would like to use in a dataset if I say Encoder evidence = Encoders.kryo(MyType.class); JavaRDD rddMyType= generateRDD(); // some code Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(), evidence); I get one column - the whole object

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Shixiong(Ryan) Zhu
Could you post the screenshot of the Streaming DAG and also the driver log? It would be great if you have a simple producer for us to debug. On Mon, Feb 29, 2016 at 1:39 AM, Abhishek Anand wrote: > Hi Ryan, > > Its not working even after removing the reduceByKey. > > So, basically I am doing the

Error when trying to insert data to a Parquet data source in HiveQL

2016-02-29 Thread SRK
Hi, I seem to be getting the following error when I try to insert data to a parquet datasource. Any idea as to why this is happening? org.apache.hadoop.hive.ql.metadata.HiveException: parquet.hadoop.MemoryManager$1: New Memory allocation 1045004 bytes is smaller than the minimum allocation size

Fwd: Mapper side join with DataFrames API

2016-02-29 Thread Deepak Gopalakrishnan
Hello All, I'm trying to join 2 dataframes A and B with a sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a"); Now what I have done is that I have registeredTempTables for A and B after loading these DataFrames from different sources. I need the join to be really fast and I was wondering i

?????? Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sea
Hi, Sumona: It's a bug in Spark old version, In spark 1.6.0, it is fixed. After the application complete, spark master will load event log to memory, and it is sync because of actor. If the event log is big, spark master will hang a long time, and you can not submit any applications,

RE: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Mohammed Guller
I believe the OP is referring to the application UI on port 4040. The application UI on port 4040 is available only while application is running. As per the documentation: To view the web UI after the fact, set spark.eventLog.enabled to true before starting the application. This configures Spark

SaveMode, parquet and S3

2016-02-29 Thread Peter Halliday
I have a system where I’m saving parquet files to S3 via Spark. They are partitioned a couple of ways first by date and then by a partition key. There are multiple parquet files per combination over long period of time. So the structure is like this: s3://bucketname/date=2016-02-29/partionke

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-29 Thread Zhun Shen
Hi, I check the dependencies and fix the bug. It work well on Spark but not on Spark Streaming. So I think I still need find another way to do it. > On Feb 26, 2016, at 2:47 PM, Zhun Shen wrote: > > Hi, > > thanks for you advice. I tried your method, I use Gradle to manage my scala > code.

[ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Divya Gehlot
Hi, I am getting error when I am trying to connect hive table (which is being created through HbaseIntegration) in spark Steps I followed : *Hive Table creation code *: CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Ted Yu
16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) Is your cluster secure cluster ? bq. Trace : Was there any output after 'Trace :' ? Was hbase-site.xml accessible to your Spark job

Re: perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Vinti Maheshwari
Hi Cody, Sorry, i realized afterwards, i should not ask here. My actual program is spark-streaming and i used kafka for input streaming. Thanks, Vinti On Mon, Feb 29, 2016 at 1:46 PM, Cody Koeninger wrote: > Does this issue involve Spark at all? Otherwise you may have better luck > on a perl

RE: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-29 Thread Silvio Fiorito
I’ve used the code below with SparkSQL. I was using this with Spark 1.4 but should still be good with 1.6. In this case I have a UDF to do the lookup, but for Streaming you’d just have a lambda to apply in a map function, so no UDF wrapper. import org.apache.spark.sql.functions._ import java.i

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
I have created jira for this feature , comments and feedback are welcome about how to improve it and whether it's valuable for users. https://issues.apache.org/jira/browse/SPARK-13587 Here's some background info and status of this work. Currently, it's not easy for user to add third party pyth

Update edge weight in graphx

2016-02-29 Thread naveen.marri
Hi, I'm trying to implement an algorithm using graphx which involves updating edge weight during every iteration. the format is [Node]-[Node]--[Weight] Ex: I checked in docs of graphx but didn't find any resources to change the weight of the edge for a same RDD I know RDDs a

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Ted Yu
16/02/29 23:09:34 INFO ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=9 watcher=hconnection-0x26fa89a20x0, quorum=localhost:2181, baseZNode=/hbase Since baseZNode didn't match what you set in hbase-site.xml, the cause was likely that hbase-site.xml being i

Spark Streaming: java.lang.StackOverflowError

2016-02-29 Thread Vinti Maheshwari
Hi All, I am getting below error in spark-streaming application, i am using kafka for input stream. When i was doing with socket, it was working fine. But when i changed to kafka it's giving error. Anyone has idea why it's throwing error, do i need to change my batch time and check pointing time?

Save DataFrame to Hive Table

2016-02-29 Thread Yogesh Vyas
Hi, I have created a DataFrame in Spark, now I want to save it directly into the hive table. How to do it.? I have created the hive table using following hiveContext: HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc()); hiveContext.sql("CREATE TABLE IF NOT EXISTS

Re: Save DataFrame to Hive Table

2016-02-29 Thread Jeff Zhang
The following line does not execute the sql so the table is not created. Add .show() at the end to execute the sql. hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT, value STRING)") On Tue, Mar 1, 2016 at 2:22 PM, Yogesh Vyas wrote: > Hi, > > I have created a DataFrame in Spark, n