Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek
Hi, What is the size of one json document ? There is also the scan of your json to define the schema, the overhead can be huge. 2 solution: define a schema and use directly during the load or ask spark to analyse a small part of the json file (I don't remember how to do it) Regards, On Thu, Ju

Re: Spark 2.4.4 with Hadoop 3.2.0

2019-11-25 Thread nihed mbarek
Hi, Spark 2.x is already part of Cloudera CDH6 who is based on Hadoop 3.x so they support officially Spark2+Hadoop3 So for sure, there is tests and development done from this side. In other part, I don't know the status for Hadoop 3.2. Regards, On Tue, Nov 26, 2019 at 1:46 AM Alfredo Marquez wr

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-14 Thread nihed mbarek
Hi I already saw a project with the same idea. https://github.com/cloudera-labs/envelope Regards, On Wed, 14 Jun 2017 at 04:32, bo yang wrote: > Thanks Benjamin and Ayan for the feedback! You kind of represent two group > of people who need such script tool or not. Personally I find the script

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread nihed mbarek
give C0 and C1 columns, I am looking to write a generic > function that concatenates the columns depending on input columns. > > like if I have something > String str = "C0,C1,C2" > > Then it should work as > > DataFrame training = orgdf.withColumn("I1", &g

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread nihed mbarek
sons = new ArrayList<>(); persons.add(new Person("nihed", "mbarek", "nihed.com")); persons.add(new Person("mark", "zuckerberg", "facebook.com")); DataFrame df = sqlContext.createDataFrame(persons, Person.cla

Re: spark.executor.cores

2016-07-15 Thread nihed mbarek
can you try with : SparkConf conf = new SparkConf().setAppName("NC Eatery app").set( "spark.executor.memory", "4g") .setMaster("spark://10.0.100.120:7077"); if (restId == 0) { conf = conf.set("spark.executor.cores", "22"); } else { conf = conf.set("spark.executor.cores", "2"); } JavaSparkContext ja

Re: remove row from data frame

2016-07-05 Thread nihed mbarek
hi, doing multiple filters to keep data that you need. regards, On Tue, Jul 5, 2016 at 5:38 PM, pseudo oduesp wrote: > Hi , > how i can remove row from data frame verifying some condition on some > columns ? > thanks > -- M'BAREK Med Nihed, Fedora Ambassador, TUNISIA, Northern Africa htt

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread nihed mbarek
Hi, Are you using a new version of kafka ? if yes since 0.9 auto.offset.reset parameter take : - earliest: automatically reset the offset to the earliest offset - latest: automatically reset the offset to the latest offset - none: throw exception to the consumer if no previous offset is

Re: removing header from csv file

2016-04-26 Thread nihed mbarek
You can add a filter with string that you are sure available only in the header Le mercredi 27 avril 2016, Divya Gehlot a écrit : > yes you can remove the headers by removing the first row > > can first() or head() to do that > > > Thanks, > Divya > > On 27 April 2016 at 13:24, Ashutosh Kumar >

Best practices repartition key

2016-04-22 Thread nihed mbarek
Hi, I'm looking for documentation or best practices about choosing a key or keys for repartition of dataframe or rdd Thank you MBAREK nihed -- M'BAREK Med Nihed, Fedora Ambassador, TUNISIA, Northern Africa http://www.nihed.com

Spark 1.6.1 already maximum pages

2016-04-21 Thread nihed mbarek
Hi I just got an issue with my execution on spark 1.6.1 I'm trying to join between two dataframes one of 5 partition and the second small 2 partition. Spark Sql shuffle partitions equal to 256000 Any idea ?? java.lang.IllegalStateException: Have already allocated a maximum of 8192 pages

Re: prefix column Spark

2016-04-19 Thread nihed mbarek
: DataFrame = { > val colNames = dataFrame.columns > colNames.foldLeft(dataFrame){ > (df, colName) => { > df.withColumnRenamed(colName, s"${prefix}_${colName}") > } > } > } > > cheers, > Ardo > > > On Tue, Apr 19, 2016

prefix column Spark

2016-04-19 Thread nihed mbarek
Hi, I want to prefix a set of dataframes and I try two solutions: * A for loop calling withColumnRename based on columns() * transforming my Dataframe to and RDD, updating the old schema and recreating the dataframe. both are working for me, the second one is faster with tables that contain 800

Spark Yarn closing sparkContext

2016-04-14 Thread nihed mbarek
Hi, I have an issue with closing my application context, the process take a long time with a fail at the end. In other part, my result was generate in the write folder and _SUCESS file was created. I'm using spark 1.6 with yarn. any idea ? regards, -- MBAREK Med Nihed, Fedora Ambassador, TUNIS

Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread nihed mbarek
the Hadoop > config it should do it. > > > From: nihed mbarek > > Date: Friday, April 8, 2016 at 12:01 PM > To: "User@spark.apache.org > " < > User@spark.apache.org > > > Subject: How to configure parquet.block.size on Spark 1.6 > > Hi >

How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread nihed mbarek
Hi How to configure parquet.block.size on Spark 1.6 ? Thank you Nihed MBAREK -- M'BAREK Med Nihed, Fedora Ambassador, TUNISIA, Northern Africa http://www.nihed.com <http://tn.linkedin.com/in/nihed>

Join FetchFailedException

2016-04-01 Thread nihed mbarek
Hi, I have a big dataframe 100giga that I need to join with 3 others dataframes. For the first join, it's ok For the second, it's ok But for the third, just after the big shuffle, before the execution of the stage, I have an exception org.apache.spark.shuffle.FetchFailedException: java.io.FileNot