Thanks Davies. HiveContext seems neat to use :) On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu <dav...@databricks.com> wrote:
> As Aram said, there two options in Spark 1.4, > > 1) Use the HiveContext, then you got datediff from Hive, > df.selectExpr("datediff(d2, d1)") > 2) Use Python UDF: > ``` > >>> from datetime import date > >>> df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, > 26))], ['d1', 'd2']) > >>> from pyspark.sql.functions import udf > >>> from pyspark.sql.types import IntegerType > >>> diff = udf(lambda a, b: (a - b).days, IntegerType()) > >>> df.select(diff(df.d1, df.d2)).show() > +-------------------------+ > |PythonUDF#<lambda>(d1,d2)| > +-------------------------+ > | -39| > +-------------------------+ > ``` > > On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan > <aram.mkrtchyan...@gmail.com> wrote: > > Hi, > > > > hope this will help you > > > > import org.apache.spark.sql.functions._ > > import sqlContext.implicits._ > > import java.sql.Timestamp > > > > val df = sc.parallelize(Array((date1, date2))).toDF("day1", "day2") > > > > val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) => > > Days.daysBetween(new DateTime(value2.getTime), new > > DateTime(value1.getTime)).getDays) > > df.withColumn("diff", dateDiff(df("day2"), df("day1"))).show() > > > > or you can write sql query using hiveql's datediff function. > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > > > > On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel <dhaval1...@gmail.com> > wrote: > >> > >> More update on this question......I am using spark 1.4.1. > >> > >> I was just reading documentation of spark 1.5 (still in development) > and I > >> think there will be a new func *datediff* that will solve the issue. So > >> please let me know if there is any work-around until spark 1.5 is out > :). > >> > >> pyspark.sql.functions.datediff(end, start)[source] > >> > >> Returns the number of days from start to end. > >> > >> >>> df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], > ['d1', > >> >>> 'd2']) > >> >>> df.select(datediff(df.d2, df.d1).alias('diff')).collect() > >> [Row(diff=32)] > >> > >> New in version 1.5. > >> > >> > >> On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel <dhaval1...@gmail.com> > >> wrote: > >>> > >>> Apologies, sent too early accidentally. Actual message is below > >>> ======================================================== > >>> > >>> A dataframe has 2 datecolumns (datetime type) and I would like to add > >>> another column that would have difference between these two dates. > Dataframe > >>> snippet is below. > >>> > >>> new_df.show(5) > >>> +-----------+----------+--------------+ > >>> | PATID| SVCDATE|next_diag_date| > >>> +-----------+----------+--------------+ > >>> |12345655545|2012-02-13| 2012-02-13| > >>> |12345655545|2012-02-13| 2012-02-13| > >>> |12345655545|2012-02-13| 2012-02-27| > >>> +-----------+----------+--------------+ > >>> > >>> > >>> > >>> Here is what I have tried so far: > >>> > >>> -> new_df.withColumn('SVCDATE2', > >>> (new_df.next_diag_date-new_df.SVCDATE)).show() > >>> Error: DateType does not support numeric operations > >>> > >>> -> new_df.withColumn('SVCDATE2', > >>> (new_df.next_diag_date-new_df.SVCDATE).days).show() > >>> Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); > >>> > >>> > >>> However this simple python code works fine with pySpark: > >>> > >>> from datetime import date > >>> d0 = date(2008, 8, 18) > >>> d1 = date(2008, 9, 26) > >>> delta = d0 - d1 > >>> print (d0 - d1).days > >>> > >>> # -39 > >>> > >>> > >>> Any suggestions would be appreciated! Also is there a way to add a new > >>> column in dataframe without using column expression (e.g. like in > pandas or > >>> R. df$new_col = 'new col value')? > >>> > >>> > >>> Thanks, > >>> Dhaval > >>> > >>> > >>> > >>> On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel <dhaval1...@gmail.com> > >>> wrote: > >>>> > >>>> new_df.withColumn('SVCDATE2', > >>>> (new_df.next_diag_date-new_df.SVCDATE).days).show() > >>>> > >>>> +-----------+----------+--------------+ | PATID| > SVCDATE|next_diag_date| > >>>> +-----------+----------+--------------+ |12345655545|2012-02-13| > 2012-02-13| > >>>> |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| > 2012-02-27| > >>>> +-----------+----------+--------------+ > >>> > >>> > >> > > >