query on Spark Log directory

2017-01-05 Thread Divya Gehlot
Hi , I am using EMR machine and I could see the Spark log directory has grown till 4G. file name : spark-history-server.out Need advise how can I reduce the the size of the above mentioned file. Is there config property which can help me . Thanks, Divya

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread neil90
Assuming you don't have your environment variables setup in your .bash_profile you would do it like this - import os import sys spark_home = '/usr/local/spark' sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.1-src.zip')) #os.environ['P

Re: Setting Spark Properties on Dataframes

2017-01-05 Thread neil90
Can you be more specific on what you would want to change on the DF level? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Spark java with Google Store

2017-01-05 Thread Manohar753
Hi Team, Can some please share any examples on spark java read and write files from Google Store. Thanks You in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-java-with-Google-Store-tp28276.html Sent from the Apache Spark User List mailin

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
>From the stack it looks to be an error from the explicit call to >hadoop.fs.FileSystem. Is the URL scheme for s3n registered? Does it work when you try to read from s3 from Spark? _ From: Ankur Srivastava mailto:ankur.srivast...@gmail.com>> Sent: Wednesday, January

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Palash Gupta
Hi Marco and respected member, I have done all the possible things suggested by Forum but still I'm having same issue: 1. I will migrate my applications to production environment where I will have more resourcesPalash>> I migrated my application in production where I have more CPU Cores, Memory

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Marco Mistroni
Hi If it only happens when u run 2 app at same time could it be that these 2 apps somehow run on same host? Kr On 5 Jan 2017 9:00 am, "Palash Gupta" wrote: > Hi Marco and respected member, > > I have done all the possible things suggested by Forum but still I'm > having same issue: > > 1. I wil

Spark Read from Google store and save in AWS s3

2017-01-05 Thread Manohar753
Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S3 and my spark engine runs on AWS Cluster. Please let me back is there any way for thi

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Palash Gupta
Hi Macro, Yes it was in the same host when problem was found. Even when I tried to start with different host, the problem is still there. Any hints or suggestion will be appreciated.  Thanks & Best Regards, Palash Gupta From: Marco Mistroni To: Palash Gupta Cc: ayan guha ; User Sent:

ToLocalIterator vs collect

2017-01-05 Thread Rohit Verma
Hi all, I am aware that collect will return a list aggregated on driver, this will return OOM when we have a too big list. Is toLocalIterator safe to use with very big list, i want to access all values one by one. Basically the goal is to compare two sorted rdds (A and B) to find top k entries

Re: ToLocalIterator vs collect

2017-01-05 Thread Richard Startin
Why not do that with spark sql to utilise the executors properly, rather than a sequential filter on the driver. Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k If you were sorting just so you could iterate in order, this might save you a couple of sorts too. https://rich

unsubscribe

2017-01-05 Thread Nikola Z

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Marco Mistroni
If it is in same host...It is expected. Afaik u cannot create >1 spark CTX on same host. All I can suggest is to run. Ur apps outside cluster and on 2 different hosts. If that fails u will need to put. Logs in ur failing app to determine why it is failing. If u can send me short snippet for the two

[Spark 2.1.0] Resource Scheduling Challenge in pyspark sparkSession

2017-01-05 Thread Palash Gupta
Hi User Team, I'm trying to schedule resource in spark 2.1.0 using below code but still all the cpu cores are captured by only single spark application and hence no other application is starting. Could you please help me out: sqlContext = SparkSession.builder.master("spark://172.26.7.192:7077").

Re: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Steve Loughran
On 5 Jan 2017, at 09:58, Manohar753 mailto:manohar.re...@happiestminds.com>> wrote: Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it tries to use the default file system. I tried looking up how to update the default file system but could not find anyth

Re: Setting Spark Properties on Dataframes

2017-01-05 Thread neil90
This blog post(Not mine) has some nice examples - https://hadoopist.wordpress.com/2016/08/19/how-to-create-compressed-output-files-in-spark-2-0/ >From the blog - df.write.mode("overwrite").format("parquet").option("compression", "none").mode("overwrite").save("/tmp/file_no_compression_parq")

Re: Help in generating unique Id in spark row

2017-01-05 Thread Olivier Girardot
There is a way, you can use org.apache.spark.sql.functions.monotonicallyIncreasingId it will give each rows of your dataframe a unique Id On Tue, Oct 18, 2016 10:36 AM, ayan guha guha.a...@gmail.com wrote: Do you have any primary key or unique identifier in your data? Even if multiple column

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Jon G
I don't use MapR but I use pyspark with jupyter, and this MapR blogpost looks similar to what I do to setup: https://community.mapr.com/docs/DOC-1874-how-to-use-jupyter-pyspark-on-mapr On Thu, Jan 5, 2017 at 3:05 AM, neil90 wrote: > Assuming you don't have your environment variables setup in y

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Marco Mistroni
Hi might be off topic, but databricks has a web application in whicn you can use spark with jupyter. have a look at https://community.cloud.databricks.com kr On Thu, Jan 5, 2017 at 7:53 PM, Jon G wrote: > I don't use MapR but I use pyspark with jupyter, and this MapR blogpost > looks similar

Re: Spark SQL - Applying transformation on a struct inside an array

2017-01-05 Thread Olivier Girardot
So, it seems the only way I found for now is a recursive handling of the Row instances directly, but to do that I have to go back to RDDs, i've put together a simple test case demonstrating the problem : import org.apache.spark.sql.{DataFrame, SparkSession} import org.scalatest.{FlatSpec, Matchers}

RE: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Manohar Reddy
Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) will that identify and write/read appropriate cloud. Is that my u

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
Right, I'd agree, it seems to be only with delete. Could you by chance run just the delete to see if it fails FileSystem.get(sc.hadoopConfiguration) .delete(new Path(somepath), true) From: Ankur Srivastava Sent: Thursday, January 5, 2017 10:05:03 AM To: Felix Che

Writing Parquet from Avro objects - cannot write null value for numeric fields

2017-01-05 Thread Sunita Arvind
Hello Experts, I am trying to allow null values in numeric fields. Here are the details of the issue I have: http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields I also tried making all columns nullable by using the below function (from one of the

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung wrote: > Right, I'd agree, it seems to be only with delete. > > Could you by chance run just

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Ankur Srivastava
Adding DEV mailing list to see if this is a defect with ConnectedComponent or if they can recommend any solution. Thanks Ankur On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava wrote: > Yes I did try it out and it choses the local file system as my checkpoint > location starts with s3n:// > > I

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
This is likely a factor of your hadoop config and Spark rather then anything specific with GraphFrames. You might have better luck getting assistance if you could isolate the code to a simple case that manifests the problem (without GraphFrames), and repost. Fr

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
Would it be more robust to use the Path when creating the FileSystem? https://github.com/graphframes/graphframes/issues/160 On Thu, Jan 5, 2017 at 4:57 PM, Felix Cheung wrote: > This is likely a factor of your hadoop config and Spark rather then > anything specific with GraphFrames. > > You migh

unsubscribe

2017-01-05 Thread bobwang
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

newAPIHadoopFile bad performance

2017-01-05 Thread Mudasar
Hi, I am using newAPIHadoopFile to process large number of s3 files(around 20 thousand) by passing URLs as comma separated String. It take around *7 minutes* to start the job. I am running the job on EMR 5.2.0 with spark 2.0.2. Here is the code Configuration conf = new Configuration();