Re: Dataset and JavaRDD: how to eliminate the header.

Aseem Bansal Wed, 03 Aug 2016 10:14:21 -0700

Hi

Depending on how how you reading the data in the first place, can you
 simply use the header as header instead of a row?


http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq)

See the header option

On Wed, Aug 3, 2016 at 10:14 PM, Carlo.Allocca <carlo.allo...@open.ac.uk>
wrote:

> Hi All,
>
> I would like to apply a  regression to my data. One of the workflow is the
> prepare my data as a JavaRDD<LabeledPoint>  starting from a Dataset<Row>
> with its header.  So, what I did was the following:
>
> == Step 1: transform the Dataset<Row>  into JavaRDD<Row>
>         JavaRDD<Row> dataPointsWithHeader =modelDS.toJavaRDD();
>
>
> == Step 2: take the first row (I was thinking that it was the header)
> Row header= dataPointsWithHeader.first();
>
> == Step 3: eliminate the row header by
> JavaRDD<Row> dataPointsWithoutHeader = dataPointsWithHeader.filter((Row
> row) -> {
>                 return !row.equals(header);
>             });
>
> The issue with the above approach are:
>
> a) the result of the Step 2 is not the header row;
> b) the application of the Step 3 is very inefficient in case there is a
> way to access to the header.
>
> My question is:
>
> Is the an efficient way to access to the header and eliminate it ?
>
> Many Thanks in advance for your help and suggestion.
>
> Regards,
> Carlo
> -- The Open University is incorporated by Royal Charter (RC 000391), an
> exempt charity in England & Wales and a charity registered in Scotland (SC
> 038302). The Open University is authorised and regulated by the Financial
> Conduct Authority.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Dataset and JavaRDD: how to eliminate the header.

Reply via email to