Hi all,

I am having some troubles when using a custom udf in dataframes with
pyspark 1.4.

I have rewritten the udf to simplify the problem and it gets even weirder.
The udfs I am using do absolutely nothing, they just receive some value and
output the same value with the same format.

I show you my code below:

c= a.join(b, a['ID'] == b['ID_new'], 'inner')

c.filter(c['ID'] == 'XX').show()

udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())

d = c.select(c['ID'],
c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'),
udf_B(vinc_muestra['t1']).alias('tc'),
udf_C(vinc_muestra['t2']).alias('td'))

d.filter(d['ID'] == 'XX').show()

I am showing here the results from the outputs:

+----------------+----------------+----------+----------+
|          ID     |     ID_new  |     t1 |   t2  |
+----------------+----------------+----------+----------+
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
+----------------+----------------+----------+----------+

+----------------+---------------+---------------+------------+------------+
|       ID        |     ta  |    tb  | tc    |  td   |
+----------------+---------------+---------------+------------+------------+
|6000000002698917|     2012-02-28|       20070305|    20030305|    20140228|
|6000000002698917|     2012-02-20|       20070215|    20020215|    20130220|
|6000000002698917|     2012-02-28|       20070310|    20050310|    20140228|
|6000000002698917|     2012-02-20|       20070305|    20030305|    20130220|
|6000000002698917|     2012-02-20|       20130802|    20130102|    20130220|
|6000000002698917|     2012-02-28|       20070215|    20020215|    20140228|
|6000000002698917|     2012-02-28|       20070215|    20020215|    20140228|
|6000000002698917|     2012-02-20|       20140102|    20130102|    20130220|
+----------------+---------------+---------------+------------+------------+

My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe
'd' are completely different from values 't1' and 't2' in dataframe c even
when my udfs are doing nothing. It seems like if values were somehow got
from other registers (or just invented). Results are different between
executions (apparently random).

Any insight on this?

Thanks in advance

Reply via email to