Hi Aseem, Thank you very much for your help.
Please, allow me to be more specific for my case (to some extent I already do what you suggested): Let us imagine that I two csv datasets d1 and d2. I generate the Dataset<Row> as in the following: == Reading d1: sparkSession=spark; options = new HashMap(); options.put("header", "true"); options.put("delimiter", delimiter); options.put("nullValue", nullValue); DataFrameReader d1_DFR = spark.read().options(options); this.dataset1 = d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); == Reading d2 sparkSession=spark; options = new HashMap(); options.put("header", "true"); options.put("delimiter", delimiter); options.put("nullValue", nullValue); DataFrameReader d2_DFR = spark.read().options(options); this.dataset2 = d2_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); So far, I have the header set to true. Now, let us imagine that we need to do a Join between the two dataset: Dataset<Row> dataset1_Join_dataset2 = dataset1.join(dataset2, “some condition”); All the below process, Step1, Step2 and Step3, starts from dataset1_Join_dataset2. And, in particular, I realised that the steps == Step 1: transform the Dataset<Row> into JavaRDD<Row> JavaRDD<Row> dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD(); == Step 2: take the first row (I was thinking that it was the header) Row header= dataPointsWithHeader.first(); the header is not the first(). So my question still is: Is the an efficient way to access to the header and eliminate it ? Many Thanks in advance for your support. Best Regards, Carlo On 3 Aug 2016, at 18:13, Aseem Bansal <asmbans...@gmail.com<mailto:asmbans...@gmail.com>> wrote: Hi Depending on how how you reading the data in the first place, can you simply use the header as header instead of a row? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq) See the header option On Wed, Aug 3, 2016 at 10:14 PM, Carlo.Allocca <carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote: Hi All, I would like to apply a regression to my data. One of the workflow is the prepare my data as a JavaRDD<LabeledPoint> starting from a Dataset<Row> with its header. So, what I did was the following: == Step 1: transform the Dataset<Row> into JavaRDD<Row> JavaRDD<Row> dataPointsWithHeader =modelDS.toJavaRDD(); == Step 2: take the first row (I was thinking that it was the header) Row header= dataPointsWithHeader.first(); == Step 3: eliminate the row header by JavaRDD<Row> dataPointsWithoutHeader = dataPointsWithHeader.filter((Row row) -> { return !row.equals(header); }); The issue with the above approach are: a) the result of the Step 2 is not the header row; b) the application of the Step 3 is very inefficient in case there is a way to access to the header. My question is: Is the an efficient way to access to the header and eliminate it ? Many Thanks in advance for your help and suggestion. Regards, Carlo -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>