Re: pyspark 1.4 udf change date values

Luis Guerra Fri, 17 Jul 2015 03:30:07 -0700

Sure, I have created JIRA SPARK-9131 - UDF change data values
<https://issues.apache.org/jira/browse/SPARK-9131>


On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu <[email protected]> wrote:

> Thanks for reporting this, could you file a JIRA for it?
>
> On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra <[email protected]>
> wrote:
> > Hi all,
> >
> > I am having some troubles when using a custom udf in dataframes with
> pyspark
> > 1.4.
> >
> > I have rewritten the udf to simplify the problem and it gets even
> weirder.
> > The udfs I am using do absolutely nothing, they just receive some value
> and
> > output the same value with the same format.
> >
> > I show you my code below:
> >
> > c= a.join(b, a['ID'] == b['ID_new'], 'inner')
> >
> > c.filter(c['ID'] == 'XX').show()
> >
> > udf_A = UserDefinedFunction(lambda x: x, DateType())
> > udf_B = UserDefinedFunction(lambda x: x, DateType())
> > udf_C = UserDefinedFunction(lambda x: x, DateType())
> >
> > d = c.select(c['ID'], c['t1'].alias('ta'),
> > udf_A(vinc_muestra['t2']).alias('tb'),
> > udf_B(vinc_muestra['t1']).alias('tc'),
> > udf_C(vinc_muestra['t2']).alias('td'))
> >
> > d.filter(d['ID'] == 'XX').show()
> >
> > I am showing here the results from the outputs:
> >
> > +----------------+----------------+----------+----------+
> > |          ID     |     ID_new  |     t1 |   t2  |
> > +----------------+----------------+----------+----------+
> > |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> > |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> > |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> > |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> > |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> > |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> > |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> > |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> > +----------------+----------------+----------+----------+
> >
> >
> +----------------+---------------+---------------+------------+------------+
> > |       ID        |     ta  |    tb  | tc    |  td   |
> >
> +----------------+---------------+---------------+------------+------------+
> > |6000000002698917|     2012-02-28|       20070305|    20030305|
> 20140228|
> > |6000000002698917|     2012-02-20|       20070215|    20020215|
> 20130220|
> > |6000000002698917|     2012-02-28|       20070310|    20050310|
> 20140228|
> > |6000000002698917|     2012-02-20|       20070305|    20030305|
> 20130220|
> > |6000000002698917|     2012-02-20|       20130802|    20130102|
> 20130220|
> > |6000000002698917|     2012-02-28|       20070215|    20020215|
> 20140228|
> > |6000000002698917|     2012-02-28|       20070215|    20020215|
> 20140228|
> > |6000000002698917|     2012-02-20|       20140102|    20130102|
> 20130220|
> >
> +----------------+---------------+---------------+------------+------------+
> >
> > My problem here is that values at columns 'tb', 'tc' and 'td' in
> dataframe
> > 'd' are completely different from values 't1' and 't2' in dataframe c
> even
> > when my udfs are doing nothing. It seems like if values were somehow got
> > from other registers (or just invented). Results are different between
> > executions (apparently random).
> >
> > Any insight on this?
> >
> > Thanks in advance
> >
>

Re: pyspark 1.4 udf change date values

Reply via email to