Re: Dataset and JavaRDD: how to eliminate the header.

Carlo . Allocca Wed, 03 Aug 2016 10:33:09 -0700

Hi Aseem,

Thank you very much for your help.


Please, allow me to be more specific for my case (to some extent I already do 
what you suggested):

Let us imagine that I two csv datasets d1 and d2. I generate the Dataset<Row> 
as in the following:

== Reading d1:

sparkSession=spark;

        options = new HashMap();
        options.put("header", "true");
        options.put("delimiter", delimiter);
        options.put("nullValue", nullValue);
        DataFrameReader d1_DFR = spark.read().options(options);
        this.dataset1 = 
d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);

== Reading d2

sparkSession=spark;

        options = new HashMap();
        options.put("header", "true");
        options.put("delimiter", delimiter);
        options.put("nullValue", nullValue);
        DataFrameReader d2_DFR = spark.read().options(options);
        this.dataset2 = 
d2_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);


So far, I have the header set to true.

Now, let us imagine that we need to do a Join between the two dataset:

Dataset<Row> dataset1_Join_dataset2 = dataset1.join(dataset2, “some condition”);

All the below process, Step1, Step2 and Step3, starts from 
dataset1_Join_dataset2. And, in particular, I realised that the steps

== Step 1: transform the Dataset<Row>  into JavaRDD<Row>
        JavaRDD<Row> dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD();

== Step 2: take the first row (I was thinking that it was the header)
Row header= dataPointsWithHeader.first();

the header is not the first().

 So my question still is:

Is the an efficient way to access to the header and eliminate it ?

Many Thanks in advance for your support.

Best Regards,
Carlo




On 3 Aug 2016, at 18:13, Aseem Bansal 
<asmbans...@gmail.com<mailto:asmbans...@gmail.com>> wrote:

Hi

Depending on how how you reading the data in the first place, can you  simply 
use the header as header instead of a row?

http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq)

See the header option

On Wed, Aug 3, 2016 at 10:14 PM, Carlo.Allocca 
<carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote:
Hi All,

I would like to apply a  regression to my data. One of the workflow is the 
prepare my data as a JavaRDD<LabeledPoint>  starting from a Dataset<Row> with 
its header.  So, what I did was the following:

== Step 1: transform the Dataset<Row>  into JavaRDD<Row>
        JavaRDD<Row> dataPointsWithHeader =modelDS.toJavaRDD();


== Step 2: take the first row (I was thinking that it was the header)
Row header= dataPointsWithHeader.first();

== Step 3: eliminate the row header by
JavaRDD<Row> dataPointsWithoutHeader = dataPointsWithHeader.filter((Row row) -> 
{
                return !row.equals(header);
            });

The issue with the above approach are:

a) the result of the Step 2 is not the header row;
b) the application of the Step 3 is very inefficient in case there is a way to 
access to the header.

My question is:

Is the an efficient way to access to the header and eliminate it ?

Many Thanks in advance for your help and suggestion.

Regards,
Carlo
-- The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302). 
The Open University is authorised and regulated by the Financial Conduct 
Authority.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Re: Dataset and JavaRDD: how to eliminate the header.

Reply via email to