Spark big rdd problem

2015-12-15 Thread Eran Witkon
When running val data = sc.wholeTextFile("someDir/*") data.count() I get numerous warning from yarn till I get aka association exception. Can someone explain what happen when spark loads this rdd and can't fit it all in memory? Based on the exception it looks like the server is disconnecting from

Re: Spark big rdd problem

2015-12-15 Thread Eran Witkon
e find the cause. > > Thanks. > > Zhan Zhang > > On Dec 15, 2015, at 11:50 AM, Eran Witkon wrote: > > > When running > > val data = sc.wholeTextFile("someDir/*") data.count() > > > > I get numerous warning from yarn till I get aka association ex

Re: Spark big rdd problem

2015-12-15 Thread Eran Witkon
gt; Thanks. > > Zhan Zhang > > On Dec 15, 2015, at 9:58 PM, Eran Witkon wrote: > > If the problem is containers trying to use more memory then they allowed, > how do I limit them? I all ready have executor-memory 5G > Eran > On Tue, 15 Dec 2015 at 23:10 Zhan Zhang wrot

Re: Spark big rdd problem

2015-12-16 Thread Eran Witkon
ec 2015 at 08:27 Eran Witkon wrote: > But what if I don't have more memory? > On Wed, 16 Dec 2015 at 08:13 Zhan Zhang wrote: > >> There are two cases here. If the container is killed by yarn, you can >> increase jvm overhead. Otherwise, you have to increase the executor-

WholeTextFile for 8000~ files - problem

2015-12-16 Thread Eran Witkon
Hi, I have about 8K files on about 10 directories on hdfs and I need to add a column to all files with the file name (e.g. file1.txt adds a column with file1.txt, file 2 with "file2.txt" etc) The current approach was to read all files using *sc.WholeTextFiles("myPath") *and have the file name as k

Re: WholeTextFile for 8000~ files - problem

2015-12-16 Thread Eran Witkon
As a followup to all of this, When I try increasing the number of partitions ( *sc.WholeTextFiles("/MySource/dir1/*",8) I get the out of memory much faster.* *Eran * On Wed, Dec 16, 2015 at 5:23 PM Eran Witkon wrote: > Hi, > I have about 8K files on about 10 directories on hd

Using Spark to process JSON with gzip filed

2015-12-16 Thread Eran Witkon
Hi, I have a few JSON files in which one of the field is a binary filed - this field is the output of running GZIP of a JSON stream and compressing it to the binary field. Now I want to de-compress the field and get the outpur JSON. I was thinking of running map operation and passing a function to

Can't run spark on yarn

2015-12-17 Thread Eran Witkon
Hi, I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn spark-env.sh export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop bash_profile #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTAL

Re: Using Spark to process JSON with gzip filed

2015-12-19 Thread Eran Witkon
ount) > while (count > 0) { > count = inflater.inflate(decompressedData) > finalData = finalData ++ decompressedData.take(count) > } > new String(finalData) > }) > > > > > Thanks > Best Regards > > On Wed, Dec 16, 2015 at 10:02 PM, Er

spark 1.5.2 memory leak? reading JSON

2015-12-19 Thread Eran Witkon
Hi, I tried the following code in spark-shell on spark1.5.2: *val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")* *df.count()* 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 3 15/12/19 23:49:40 ERROR Executor: Exception in

combining multiple JSON files to one DataFrame

2015-12-19 Thread Eran Witkon
Hi, Can I combine multiple JSON files to one DataFrame? I tried val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*") but I get an empty DF Eran

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Eran Witkon
gt; Just point loader to the folder. You do not need * > On Dec 19, 2015 11:21 PM, "Eran Witkon" wrote: > >> Hi, >> Can I combine multiple JSON files to one DataFrame? >> >> I tried >> val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*") >> but I get an empty DF >> Eran >> >

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Eran Witkon
disregard my last question - my mistake. I accessed it as a col not as a row : jsonData.first.getAs[String]("cty") Eran On Sun, Dec 20, 2015 at 11:42 AM Eran Witkon wrote: > Thanks, That's works. > One other thing - > I have the following code: > > val jsonData

DataFrame operations

2015-12-20 Thread Eran Witkon
Hi, I am a bit confused with dataframe operations. I have a function which takes a string and returns a string I want to apply this functions on all rows on a single column in my dataframe I was thinking of the following: jsonData.withColumn("computedField",computeString(jsonData("hse"))) BUT js

error: not found: value StructType on 1.5.2

2015-12-20 Thread Eran Witkon
Hi, I am using spark-shell with version 1.5.2. scala> sc.version res17: String = 1.5.2 but when trying to use StructType I am getting error: val struct = StructType( StructField("a", IntegerType, true) :: StructField("b", LongType, false) :: StructField("c", BooleanType, false) :: Ni

Re: error: not found: value StructType on 1.5.2

2015-12-20 Thread Eran Witkon
Yes, this works... Thanks On Sun, Dec 20, 2015 at 3:57 PM Peter Zhang wrote: > Hi Eran, > > Missing import package. > > import org.apache.spark.sql.types._ > > will work. please try. > > Peter Zhang > -- > Google > Sent with Airmail > > On December

How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Hi, I have an RDD jsonGzip res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = MapPartitionsRDD[8] at map at :65 which I want to convert to a DataFrame with schema so I created a schema: al schema = StructType( StructField("cty", StringType, false) :: StructField("hse"

Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
/people.txt").map( >* _.split(",")).map(p => Row(p(0), p(1).trim.toInt)) >* val dataFrame = sqlContext.createDataFrame(people, schema) >* dataFrame.printSchema >* // root >* // |-- name: string (nullable = false) >* // |-- age: integ

Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Got it to work, thanks On Sun, 20 Dec 2015 at 17:01 Eran Witkon wrote: > I might be missing you point but I don't get it. > My understanding is that I need a RDD containing Rows but how do I get it? > > I started with a DataFrame > run a map on it and got the RDD [string,stri

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Eran Witkon
r XML, btw. there's lots of wonky workarounds > for this that use MapPartitions and all kinds of craziness. the best > option, in my opinion, is to just ETL/flatten the data to make the > DataFrame reader happy. > > On Dec 19, 2015, at 4:55 PM, Eran Witkon

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Eran Witkon
that it cannot parse. > Instead, it will put the entire record to the column of "_corrupt_record". > > Thanks, > > Yin > > On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon wrote: > >> Thanks for this! >> This was the problem... >> >> On Sun, 20 De

Re: DataFrame operations

2015-12-20 Thread Eran Witkon
Ptoblem resolved, syntext issue )-: On Mon, 21 Dec 2015 at 06:09 Jeff Zhang wrote: > If it does not return a column you expect, then what does this return ? Do > you will have 2 columns with the same column name ? > > On Sun, Dec 20, 2015 at 7:40 PM, Eran Witkon wrote: > >>

fishing for help!

2015-12-21 Thread Eran Witkon
Hi, I know it is a wide question but can you think of reasons why a pyspark job which runs on from server 1 using user 1 will run faster then the same job when running on server 2 with user 1 Eran

Using inteliJ for spark development

2015-12-21 Thread Eran Witkon
Any pointers how to use InteliJ for spark development? Any way to use scala worksheet run like spark- shell?

Re: fishing for help!

2015-12-21 Thread Eran Witkon
is assuming that you're > running the driver in non-cluster deploy mode (so the driver process runs > on the machine which submitted the job). > > On Mon, Dec 21, 2015 at 1:30 PM, Igor Berman > wrote: > >> look for differences: packages versions, cpu/network/

Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon > wrote: > >> Any pointers how to use InteliJ for spark development? >> Any way to use scala worksheet run like spark- shell? >> > >

Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
rstanding > http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits > > Thanks > Best Regards > > On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon wrote: > >> Thanks, all of these examples shows how to link to spark source and build >> it as part of

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Eran Witkon
Did you get a solution for this? On Tue, 22 Dec 2015 at 20:24 raja kbv wrote: > Hi, > > I am new to spark. > > I have a text file with below structure. > > > (employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, > Description, Duriation, Role}]}) > Eg: > (123456, Employee1, {“

Extract compressed JSON withing JSON

2015-12-24 Thread Eran Witkon
Hi, I have a JSON file with the following row format: {"cty":"United Kingdom","gzip":"H4sIAKtWystVslJQcs4rLVHSUUouqQTxQvMyS1JTFLwz89JT8nOB4hnFqSBxj/zS4lSF/DQFl9S83MSibKBMZVExSMbQwNBM19DA2FSpFgDvJUGVUw==","nm":"Edmund lronside","yrs":"1016"} The gzip field is a compressed JSON by itsel

Re: How to Parse & flatten JSON object in a text file using Spark &Scala into Dataframe

2015-12-24 Thread Eran Witkon
JSON string representation of the whole line and you have a nested JSON schema which SparkSQL can read. Eran On Thu, Dec 24, 2015 at 10:26 AM Eran Witkon wrote: > I don't have the exact answer for you but I would look for something using > explode method on DataFrame > > On Thu

Re: Extract compressed JSON withing JSON

2015-12-24 Thread Eran Witkon
t;nm": \"$nm\" , "yrs": \"$yrs\"}"""}) See this link for source http://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes Eran On Thu, Dec 24, 2015 at 11:42 AM Eran Witkon

Re: How to ignore case in dataframe groupby?

2015-12-24 Thread Eran Witkon
Use DF.withColumn("upper-code",df("countrycode).toUpper)) or just run a map function that does the same On Thu, Dec 24, 2015 at 2:05 PM Bharathi Raja wrote: > Hi, > Values in a dataframe column named countrycode are in different cases. Eg: > (US, us). groupBy & count gives two rows but the requ

Re: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Eran Witkon
If you drop other columns (or map to a new df with only that column) and call collect i think you will get what you want. On Fri, 25 Dec 2015 at 10:26 fightf...@163.com wrote: > Emm...I think you can do a df.map and store each column value to your list. > > -- > fightf

Re: How to ignore case in dataframe groupby?

2015-12-30 Thread Eran Witkon
lumn("upper-code",upper(df("countrycode"))). > > This creates a new column "upper-code". Is there a way to update the > column or create a new df with update column? > > Thanks, > Raja > > On Thursday, 24 December 2015 6:17 PM, Eran Witkon > wrote:

Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Eran Witkon
Can you name the features that make databricks better than zepplin? Eran On Fri, 29 Jan 2016 at 01:37 Michal Klos wrote: > We use both databricks and emr. We use databricks for our exploratory / > adhoc use cases because their notebook is pretty badass and better than > Zeppelin IMHO. > > We use