Re: Dataset and JavaRDD: how to eliminate the header.

Mich Talebzadeh Wed, 03 Aug 2016 10:45:46 -0700

Do you know the headers?

Can you use filter to get rid of the header from both CSV files before
joining them?




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 August 2016 at 18:32, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote:

> Hi Aseem,
>
> Thank you very much for your help.
>
> Please, allow me to be more specific for my case (to some extent I already
> do what you suggested):
>
> Let us imagine that I two csv datasets d1 and d2. I generate the
> Dataset<Row> as in the following:
>
> == Reading d1:
>
> sparkSession=spark;
>
>         options = new HashMap();
>         options.put("header", "true");
>         options.put("delimiter", delimiter);
>         options.put("nullValue", nullValue);
>         DataFrameReader d1_DFR = spark.read().options(options);
>         this.dataset1 =
> d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);
>
> == Reading d2
>
> sparkSession=spark;
>
>         options = new HashMap();
>         options.put("header", "true");
>         options.put("delimiter", delimiter);
>         options.put("nullValue", nullValue);
>         DataFrameReader d2_DFR = spark.read().options(options);
>         this.dataset2 =
> d2_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);
>
>
> So far, I have the header set to true.
>
> Now, let us imagine that we need to do a Join between the two dataset:
>
> Dataset<Row> dataset1_Join_dataset2 = dataset1.join(dataset2, “some
> condition”);
>
> All the below process, Step1, Step2 and Step3, starts from
> dataset1_Join_dataset2. And, in particular, I realised that the steps
>
> == Step 1: transform the Dataset<Row>  into JavaRDD<Row>
>>         JavaRDD<Row> dataPointsWithHeader
>> =dataset1_Join_dataset2.toJavaRDD();
>
>
> == Step 2: take the first row (I was thinking that it was the header)
>> Row header= dataPointsWithHeader.first();
>>
>
> the header is not the first().
>
>  So my question still is:
>
>
>> Is the an efficient way to access to the header and eliminate it ?
>>
>
> Many Thanks in advance for your support.
>
> Best Regards,
> Carlo
>
>
>
>
> On 3 Aug 2016, at 18:13, Aseem Bansal <asmbans...@gmail.com> wrote:
>
> Hi
>
> Depending on how how you reading the data in the first place, can you
>  simply use the header as header instead of a row?
>
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq)
>
> See the header option
>
> On Wed, Aug 3, 2016 at 10:14 PM, Carlo.Allocca <carlo.allo...@open.ac.uk>
> wrote:
>
>> Hi All,
>>
>> I would like to apply a  regression to my data. One of the workflow is
>> the prepare my data as a JavaRDD<LabeledPoint>  starting from a
>> Dataset<Row> with its header.  So, what I did was the following:
>>
>> == Step 1: transform the Dataset<Row>  into JavaRDD<Row>
>>         JavaRDD<Row> dataPointsWithHeader =modelDS.toJavaRDD();
>>
>>
>> == Step 2: take the first row (I was thinking that it was the header)
>> Row header= dataPointsWithHeader.first();
>>
>> == Step 3: eliminate the row header by
>> JavaRDD<Row> dataPointsWithoutHeader = dataPointsWithHeader.filter((Row
>> row) -> {
>>                 return !row.equals(header);
>>             });
>>
>> The issue with the above approach are:
>>
>> a) the result of the Step 2 is not the header row;
>> b) the application of the Step 3 is very inefficient in case there is a
>> way to access to the header.
>>
>> My question is:
>>
>> Is the an efficient way to access to the header and eliminate it ?
>>
>> Many Thanks in advance for your help and suggestion.
>>
>> Regards,
>> Carlo
>> -- The Open University is incorporated by Royal Charter (RC 000391), an
>> exempt charity in England & Wales and a charity registered in Scotland (SC
>> 038302). The Open University is authorised and regulated by the Financial
>> Conduct Authority.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>

Re: Dataset and JavaRDD: how to eliminate the header.

Reply via email to