Re: Is Spark right for us?

2016-03-07 Thread Jörn Franke
I think the Relational Database will be faster for ordinal data (eg where you answer from 1..x). For free text fields I would recommend solr or elastic search, because they have a lot more text analytics capabilities that do not exist in a relational database or MongoDB and are not likely to be

Re: Understanding the Web_UI 4040

2016-03-07 Thread Sonal Goyal
Maybe check the worker logs to see what's going wrong with it? On Mar 7, 2016 9:10 AM, "Angel Angel" wrote: > Hello Sir/Madam, > > > I am running the spark-sql application on the cluster. > In my cluster there are 3 slaves and one Master. > > When i saw the progress of my application in web UI ha

Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Divya Gehlot
Hi, Could somebody help me by providing the steps /redirect me to blog/documentation on how to run Spark job written in scala through Oozie. Would really appreciate the help. Thanks, Divya

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Deepak Sharma
There is Spark action defined for oozie workflows. Though I am not sure if it supports only Java SPARK jobs or Scala jobs as well. https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html Thanks Deepak On Mon, Mar 7, 2016 at 2:44 PM, Divya Gehlot wrote: > Hi, > > Could somebody help me b

Re: Understanding the Web_UI 4040

2016-03-07 Thread Mark Hamstra
There's probably nothing wrong other than a glitch in the reporting of Executor state transitions to the UI -- one of those low-priority items I've been meaning to look at for awhile On Mon, Mar 7, 2016 at 12:15 AM, Sonal Goyal wrote: > Maybe check the worker logs to see what's going wrong w

PYSPARK_PYTHON doesn't work in spark worker

2016-03-07 Thread guoqing0...@yahoo.com.hk
Hi all I had following configuration in spark worker (spark-env.sh) export PYTHON_HOME=/opt/soft/anaconda2 export PYSPARK_PYTHON=$PYTHON_HOME/bin/python I'm try to run a simple test script on pyspark --master yarn --queue spark --executor-cores 1 --num-executors 10 from pyspark import SparkCont

Re: PYSPARK_PYTHON doesn't work in spark worker

2016-03-07 Thread Gourav Sengupta
hi, how are you running your SPARK cluster (is it in local mode or distributed mode). Do you have pyspark installed in anaconda? Regards, Gourav Sengupta On Mon, Mar 7, 2016 at 9:28 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi all > I had following configuration in spa

Editing spark.ml package code in Pyspark

2016-03-07 Thread Khaled Ali
Is it possible to edit some codes inside spark.ml package in pyspark? e.g. I am calculating TFIDF using pyspark ML. But, I would like to do a little change on the Inverse document frequency equation/(IDF(t,D)= log ((|D|+1) /DF(t,D) + 1))/ How can I do that? -- View this message in context: http

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
Hi, So I managed to isolate the bug and I'm ready to try raising a JIRA issue. I joined the Apache Jira project so I can create tickets. However when I click Create from the Spark project home page on JIRA, it asks me to click on one of the following service desks: Kylin, Atlas, Ranger, Apache In

Re: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow

2016-03-07 Thread dmt
I have found why the exception is raised. I have defined a JSON schema, using org.apache.spark.sql.types.StructType, that expects this kind of record : /{ "request": { "user": { "id": 123 } } }/ There's a bad record in my dataset, that defines field "user" as an array, instead of

Re: Spark Aggregations/Joins

2016-03-07 Thread Ricardo Paiva
Have you ever tried to use join? Both RDD and Dataframe have this method and it does a join like traditional relational database does. On Sat, Mar 5, 2016 at 3:17 AM, Agro [via Apache Spark User List] < ml-node+s1001560n26403...@n3.nabble.com> wrote: > So, initially, I have an RDD[Int] that I've

Re: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow

2016-03-07 Thread dmt
Is there a workaround ? My dataset contains billions of rows, and it would be nice to ignore/exclude the few lines that are badly formatted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-spark-sql-types-GenericArrayData-cannot-be-cast-to-org-apa

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread Ted Yu
Have you tried clicking on Create button from an existing Spark JIRA ? e.g. https://issues.apache.org/jira/browse/SPARK-4352 Once you're logged in, you should be able to select Spark as the Project. Cheers On Mon, Mar 7, 2016 at 2:54 AM, James Hammerton wrote: > Hi, > > So I managed to isolate

using MongoDB Tailable Cursor in Spark Streaming

2016-03-07 Thread Shams ul Haque
Hi, I want to implement Streaming using Mongo Tailable. Please give me hint how can i do this. I think i have to extend some class and used its method to do the stuff. Please give me a hint. Thanks and regards Shams ul Haque

Re: reading the parquet file in spark sql

2016-03-07 Thread Manoj Awasthi
>From the parquet file content (dir content) it doesn't look like that parquet write was successful or complete. On Mon, Mar 7, 2016 at 11:17 AM, Angel Angel wrote: > Hello Sir/Madam, > > I am running one spark application having 3 slaves and one master. > > I am wring the my information using t

[Streaming] Difference between windowed stream and stream with large batch size?

2016-03-07 Thread Hao Ren
I want to understand the advantage of using windowed stream. For example, Stream 1: initial duration = 5 s, and then transformed into a stream windowed by (*windowLength = *30s, *slideInterval = *30s) Stream 2: Duration = 30 s Questions: 1. Is Stream 1 equivalent to Stream 2 on behavior ? Do u

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Benjamin Kim
To comment… At my company, we have not gotten it to work in any other mode than local. If we try any of the yarn modes, it fails with a “file does not exist” error when trying to locate the executable jar. I mentioned this to the Hue users group, which we used for this, and they replied that th

Re: how to implement ALS with csv file? getting error while calling Rating class

2016-03-07 Thread Kevin Mellott
If you are using DataFrames, then you also can specify the schema when loading as an alternate solution. I've found Spark-CSV to be a very useful library when working with CSV data. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Neelesh Salian
Hi Divya, This link should have the details that you need to begin using the Spark Action on Oozie: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html Thanks. On Mon, Mar 7, 2016 at 7:52 AM, Benjamin Kim wrote: > To comment… > > At my company, we have not gotten it to work in any

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Chandeep Singh
As a work around you could put your spark-submit statement in a shell script and then use Oozie’s SSH action to execute that script. > On Mar 7, 2016, at 3:58 PM, Neelesh Salian wrote: > > Hi Divya, > > This link should have the details that you need to begin using the Spark > Action on Oozie

Re: Is Spark right for us?

2016-03-07 Thread Guillaume Bilodeau
Hi everyone, First thanks for taking some time on your Sunday to reply. Some points in no particular order: . The feedback from everyone tells me that I have a lot of reading to do first. Thanks for all the pointers. . The data is currently stored in a row-oriented database (SQL Server 2012 to

Setting PYSPARK_PYTHON in spark-env.sh vs from driver program

2016-03-07 Thread Kostas Chalikias
All - would appreciate some insight regarding how to set PYSPARK_PYTHON correctly. I have created a virtual environment in the same place for all 3 of my cluster hosts, 2 of them running slaves and one running a master. I also run an RPC server on the master host to allow users from the office

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
Hi Ted, Thanks for getting back - I realised my mistake... I was clicking the little drop down menu on the right hand side of the Create button (it looks as if it's part of the button) - when I clicked directly on the word "Create" I got a form that made more sense and allowed me to choose the pro

Re: Is Spark right for us?

2016-03-07 Thread Mich Talebzadeh
Hi, Have you looked at SSAS and cubes. You 100GB is nothing, Even Micky Mouse SQL Server should handle that :) Remember also that you are moving from Windows Platform to Linux which may involve additional training as well. Other attractive option (well I do not know the nature of your queries) i

Saving Spark generated table into underlying Hive table using Functional programming

2016-03-07 Thread Mich Talebzadeh
Hi, I have done this Spark-shell and Hive itself so it works. I am exploring whether I can do it programmatically. The problem I encounter was that I tried to register the DF as temporary table. The problem is that trying to insert from temporary table into Hive table, II was getting the followin

Re: Is Spark right for us?

2016-03-07 Thread Laumegui Deaulobi
Thanks for your input. That 1 hour per data point actually be a problem, since sometimes we have reports with 100s of data points and need to generate 100,000 reports. So we definitely need to distribute this, but I don't know where to start with this unfortunately. On Mon, Mar 7, 2016 at 2:42 P

Updating reference data once a day in Spark Streaming job

2016-03-07 Thread Karthikeyan Muthukumar
Hi, We have reference data pulled in from an RDBMS through a Sqoop job, this reference data is pulled into the Analytics platform once a day. We have a Spark Streaming job, where at job bootup we read the reference data, and then join this reference data with continuously flowing event data. When t

Re: Saving Spark generated table into underlying Hive table using Functional programming

2016-03-07 Thread Holden Karau
So what about if you just start with a hive context, and create your DF using the HiveContext? On Monday, March 7, 2016, Mich Talebzadeh wrote: > Hi, > > I have done this Spark-shell and Hive itself so it works. > > I am exploring whether I can do it programmatically. The problem I > encounter w

Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
I recently tried to a model using org.apache.spark.ml.classification.LogisticRegression on a data set where the feature vector size was around ~20 million. It did *not* go well. It took around 10 hours to train on a substantial cluster. Additionally, it pulled a lot data back to the driver - I even

Re: Saving Spark generated table into underlying Hive table using Functional programming

2016-03-07 Thread Mich Talebzadeh
This is the code import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.functions._ // object ImportCS

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Devin Jones
Hi, Which data structure are you using to train the model? If you haven't tried yet, you should consider the SparseVector http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann wrote: > I recently tri

streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Andy Davidson
We just deployed our first streaming apps. The next step is running them so they run reliably We have spend a lot of time reading the various prog guides looking at the standalone cluster manager app performance web pages. Looking at the streaming tab and the stages tab have been the most helpful

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
Yes, it is a SparseVector. Most rows only have a few features, and all the rows together only have tens of thousands of features, but the vector size is ~ 20 million because that is the largest feature. On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones wrote: > Hi, > > Which data structure are you usi

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Devin Jones
I could be wrong but its possible that toDF populates a dataframe which I understand do not support sparsevectors at the moment. If you use the MlLib logistic regression implementation (not ml) you can pass the RDD[LabeledPoint] data type directly to the learner. http://spark.apache.org/docs/late

how to implement and deploy robust streaming apps

2016-03-07 Thread Andy Davidson
One of the challenges we need to prepare for with streaming apps is bursty data. Typically we need to estimate our worst case data load and make sure we have enough capacity It not obvious what best practices are with spark streaming. * we have implemented check pointing as described in the prog

Re: streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Igor Berman
may be you are experiencing problem with FileOutputCommiter vs DirectCommiter while working with s3? do you have hdfs so you can try it to verify? committing in s3 will copy 1-by-1 all partitions to your final destination bucket from _temporary, so this stage might become a bottleneck(so reducing

streaming will I loose data if spark.streaming.backpressure.enabled=true

2016-03-07 Thread Andy Davidson
http://spark.apache.org/docs/latest/streaming-programming-guide.html#deployi ng-applications Gives a brief discussion about max rate and back pressure Its not clear to me what will happen. I use an unreliable reciever. Let say me app is running and process time is less then window length. Happy

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Michał Zieliński
We're using SparseVector columns in a DataFrame, so they are definitely supported. But maybe for LR some implicit magic is happening inside. On 7 March 2016 at 23:04, Devin Jones wrote: > I could be wrong but its possible that toDF populates a dataframe which I > understand do not support sparse

OOM exception during Broadcast

2016-03-07 Thread Arash
Hello all, I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes but haven't been able to make it work so far. It looks like the executors start to run out of memory during deserialization. This behavior only shows itself when the number of partitions is above a few 10s, the bro

Re: OOM exception during Broadcast

2016-03-07 Thread Jeff Zhang
Any reason why do you broadcast such large variable ? It doesn't make sense to me On Tue, Mar 8, 2016 at 7:29 AM, Arash wrote: > Hello all, > > I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes > but haven't been able to make it work so far. > > It looks like the executors

Re: Saving Spark generated table into underlying Hive table using Functional programming

2016-03-07 Thread Mich Talebzadeh
Ok I solved the problem. When one uses spark-shell it starts with HiveContext so things work. The caveat is that any Spark temp table created with "registerTempTable("TABLE") has to be queried by sqlContext.sql otherwise that table is NOT visible to HiveContext.sql. To make this work with project

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Well, I'm trying to avoid a big shuffle/join, from what I could find online, my understanding was that 1G broadcast should be doable, is that not accurate? On Mon, Mar 7, 2016 at 3:34 PM, Jeff Zhang wrote: > Any reason why do you broadcast such large variable ? It doesn't make > sense to me > >

Job Jar files located in s3, driver never starts the job

2016-03-07 Thread Scott Reynolds
Following the documentation on spark-submit, http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit - application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cl

Re: OOM exception during Broadcast

2016-03-07 Thread Ankur Srivastava
Hi, We have a use case where we broadcast ~4GB of data and we are on m3.2xlarge so your object size is not an issue. Also based on your explanation does not look like a broadcast issue as it works when your partition size is small. Are you caching any other data? Because boradcast variable use th

Adding hive context gives error

2016-03-07 Thread Suniti Singh
Hi All, I am trying to create a hive context in a scala prog as follows in eclipse: Note -- i have added the maven dependency for spark -core , hive , and sql. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object D

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Hi Ankur, For this specific test, I'm only running the few lines of code that are pasted. Nothing else is cached in the cluster. Thanks, Arash On Mon, Mar 7, 2016 at 4:07 PM, Ankur Srivastava wrote: > Hi, > > We have a use case where we broadcast ~4GB of data and we are on > m3.2xlarge so your

Re: Adding hive context gives error

2016-03-07 Thread Mich Talebzadeh
I tend to use SBT to build Spark programs. This works for me import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types._ import org.apache.spark.sql.SQLContext import org.ap

Re: Adding hive context gives error

2016-03-07 Thread Kabeer Ahmed
I use SBT and I have never included spark-sql. The simple 2 lines in SBT are as below: libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.5.0", "org.apache.spark" %% "spark-hive" % "1.5.0" ) However, I do note that you are using Spark-sql include and the Spark version

Re: Adding hive context gives error

2016-03-07 Thread Suniti Singh
Thanks Mich and Kabeer for quick reply. @ Kabeer - i removed the spark - sql dependency and all the errors are gone. But i am surprised to see this behaviour. Why spark-sql lib are an issue for including the hive context? Regards, Suniti On Mon, Mar 7, 2016 at 4:34 PM, Kabeer Ahmed wrote: > I

Re: Adding hive context gives error

2016-03-07 Thread Mich Talebzadeh
Hi Kabeer, I have not used eclipse for Spark/Scala although I have played with it. As a matter of interest when you set up an Eclipse project do you add external Jars to eclipse From $SPARK_HOME/lib only? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEW

Re: Adding hive context gives error

2016-03-07 Thread Mich Talebzadeh
Sorry that should have been addressed to Suniti Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 8 Marc

Re: Adding hive context gives error

2016-03-07 Thread Suniti Singh
We do not need to add the external jars to eclipse if maven is used as a Build tool since the spark dependency in POM file will take care of it. On Mon, Mar 7, 2016 at 4:50 PM, Mich Talebzadeh wrote: > Hi Kabeer, > > I have not used eclipse for Spark/Scala although I have played with it. > > A

Re: Setting PYSPARK_PYTHON in spark-env.sh vs from driver program

2016-03-07 Thread Jeff Zhang
Hi Kostas, Environment variable PYSPARK_PYTHON of executor is not propagated from worker, it is from driver. On Tue, Mar 8, 2016 at 1:04 AM, Kostas Chalikias wrote: > All - would appreciate some insight regarding how to set PYSPARK_PYTHON > correctly. > > I have created a virtual environment in

Re: Adding hive context gives error

2016-03-07 Thread Suniti Singh
yeah i realized it and changed the version of it to 1.6.0 as mentioned in http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.6.0 I added the spark sql dependency back to the pom.xml and the scala code works just fine. On Mon, Mar 7, 2016 at 5:00 PM, Tristan Nixon wrote: > Hi

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-07 Thread Andy Davidson
Hi Vinti I use the stand alone cluster. I the mgmt console provides a link to an app UI. It has all sorts of performance info. There should be a tab ¹stages¹. You can use it to find bottlenecks Note the links in the mgmt console do not seem to work. The app UI runs on the same machine as the driv

converting to map partitions

2016-03-07 Thread dizzy5112
Hi, im trying to see if I can get the following piece of code to perform a little better. Currently I use a collect to get a val (localCollection) then loop through each of the cogroups in this array to do some work. Can I use a mappartitions on this rather than a collect, if so im completely stuck

Re: OOM exception during Broadcast

2016-03-07 Thread Takeshi Yamamuro
Hi, I think a broadcast logic itself works well in spite of input size (no idea about its efficiency). How about your memory size in a driver? when you broadcast some large variables, a driver eats much memory for splitting the variable into blocks and their serializations. Thanks, maropu On Tu

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Hi Takeshi, we've the driver memory set to 40G. I don't think the driver is having memory issues. It looks like the executors start to fail due to memory issues and then the broadcast p2p algorithm starts to fail. On Mon, Mar 7, 2016 at 5:36 PM, Takeshi Yamamuro wrote: > Hi, > > I think a broadc

Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
Hi Arash, is this static data? Have you considered including it in your jars and de-serializing it from jar on each worker node? It’s not pretty, but it’s a workaround for serialization troubles. > On Mar 7, 2016, at 5:29 PM, Arash wrote: > > Hello all, > > I'm trying to broadcast a variable

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Hi Tristan, This is not static, I actually collect it from an RDD to the driver. On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon wrote: > Hi Arash, > > is this static data? Have you considered including it in your jars and > de-serializing it from jar on each worker node? > It’s not pretty, but

Re: OOM exception during Broadcast

2016-03-07 Thread Takeshi Yamamuro
Oh, How about increasing broadcast block size in spark.broadcast.blockSize? A default size is `4m` and it is too small agains ~1GB, I think. On Tue, Mar 8, 2016 at 10:44 AM, Arash wrote: > Hi Tristan, > > This is not static, I actually collect it from an RDD to the driver. > > On Mon, Mar 7

Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
I’m not sure I understand - if it was already distributed over the cluster in an RDD, why would you want to collect and then re-send it as a broadcast variable? Why not simply use the RDD that is already distributed on the worker nodes? > On Mar 7, 2016, at 7:44 PM, Arash wrote: > > Hi Trista

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
So I just implemented the logic through a standard join (without collect and broadcast) and it's working great. The idea behind trying the broadcast was that since the other side of join is a much larger dataset, the process might be faster through collect and broadcast, since it avoids the shuffl

Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
Yeah, the spark engine is pretty clever and its best not to pre-maturely optimize. It would be interesting to profile your join vs. the collect on the smaller dataset. I suspect that the join is faster (even before you broadcast it back out). I’m also curious about the broadcast OOM - did you t

Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Josh Rosen
Does anyone implement Spark's serializer interface (org.apache.spark.serializer.Serializer) in your own third-party code? If so, please let me know because I'd like to change this interface from a DeveloperAPI to private[spark] in Spark 2.0 in order to do some cleanup and refactoring. I think that

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Koert Kuipers
we are not, but it seems reasonable to me that a user has the ability to implement their own serializer. can you refactor and break compatibility, but not make it private? On Mon, Mar 7, 2016 at 9:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer interface > (org.apache.spark.

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Ted Yu
Josh: SerializerInstance and SerializationStream would also become private[spark], right ? Thanks On Mon, Mar 7, 2016 at 6:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer interface > (org.apache.spark.serializer.Serializer) in your own third-party code? If > so, please let m

Re: Spark streaming from Kafka best fit

2016-03-07 Thread pratik khadloya
Would using mapPartitions instead of map help here? ~Pratik On Tue, Mar 1, 2016 at 10:07 AM Cody Koeninger wrote: > You don't need an equal number of executor cores to partitions. An > executor can and will work on multiple partitions within a batch, one after > the other. The real issue is w

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
The driver memory is set at 40G and OOM seems to be happening on the executors. I might try a different broadcast block size (vs 4m) as Takeshi suggested to see if it makes a difference. On Mon, Mar 7, 2016 at 6:54 PM, Tristan Nixon wrote: > Yeah, the spark engine is pretty clever and its best n

overwriting a spark output using pyspark

2016-03-07 Thread Devesh Raj Singh
I am trying to overwrite a spark dataframe using the following option but I am not successful spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path) the mode=overwrite command is not successful -- Warm regards, Devesh.

How to compile Spark with private build of Hadoop

2016-03-07 Thread Lu, Yingqi
Hi All, I am new to Spark and I have a question regarding to compile Spark. I modified trunk version of Hadoop source code. How can I compile Spark (standalone mode) with my modified version of Hadoop (HDFS, Hadoop-common and etc.)? Thanks a lot for your help! Thanks, Lucy

Re: How to compile Spark with private build of Hadoop

2016-03-07 Thread fightf...@163.com
I think you can establish your own maven repository and deploy your modified hadoop binary jar with your modified version number. Then you can add your repository in spark pom.xml and use mvn -Dhadoop.version= fightf...@163.com From: Lu, Yingqi Date: 2016-03-08 15:09 To: user@spark.apache.

Spark Partitioner vs Spark Shuffle Manager

2016-03-07 Thread Prabhu Joseph
Hi All, What is the difference between Spark Partitioner and Spark Shuffle Manager. Spark Partitioner is by default Hash partitioner and Spark shuffle manager is sort based, others are Hash, Tunsten Sort. Thanks, Prabhu Joseph

Spark Twitter streaming

2016-03-07 Thread Soni spark
Hallo friends, I need a urgent help. I am using spark streaming to get the tweets from twitter and loading the data into HDFS. I want to find out the tweet source whether it is from web or mobile web or facebook ..etc. could you please help me logic. Thanks Soniya