Hi Stuti,

The features should be standardized before training the model. Currently
AFTSurvivalRegression does not support standardization. Here is the work
around for this issue, and I will send a PR to fix this issue soon.

val ovarian = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .option("inferSchema", "true") // Automatically infer data types
      .load("......")
      .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

    val assembler = new VectorAssembler()
      .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
      .setOutputCol("features")

    val ovarian2 = assembler.transform(ovarian)
      .select(col("censor").cast(DoubleType),
col("label").cast(DoubleType), col("features"))

    val standardScaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("standardized_features")
    val ssModel = standardScaler.fit(ovarian2)
    val ovarian3 = ssModel.transform(ovarian2)

    val aft = new
AFTSurvivalRegression().setFeaturesCol("standardized_features")

    val model = aft.fit(ovarian3)

    val newCoefficients =
model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
      x._1 / x._2
    }
    println(newCoefficients.toSeq.mkString(","))
    println(model.intercept)
    println(model.scale)

Yanbo

2016-02-15 16:07 GMT+08:00 Yanbo Liang <yblia...@gmail.com>:

> Hi Stuti,
>
> This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
> infinity" properly.
> I have open https://issues.apache.org/jira/browse/SPARK-13322 to track
> this issue and will send a PR.
> Thanks for reporting this issue.
>
> Yanbo
>
> 2016-02-12 15:03 GMT+08:00 Stuti Awasthi <stutiawas...@hcl.com>:
>
>> Hi All,
>>
>> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able
>> to run the AFT example provided. Now I tried to train the model with
>> Ovarian data which is standard data comes with Survival library in R.
>>
>> Default Column Name :  *Futime,fustat,age,resid_ds,rx,ecog_ps*
>>
>>
>>
>> Here are the steps I have done :
>>
>> ·         Loaded the data from csv to dataframe labeled as
>>
>> *val* ovarian_data = sqlContext.read
>>
>>       .format("com.databricks.spark.csv")
>>
>>       .option("header", "true") // Use first line of all files as header
>>
>>       .option("inferSchema", "true") // Automatically infer data types
>>
>>       .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds",
>> "rx", "ecog_ps")
>>
>> ·         Utilize the VectorAssembler() to create features from "age",
>> "resid_ds", "rx", "ecog_ps" like
>>
>> *val* assembler = *new* VectorAssembler()
>>
>> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>>
>> .setOutputCol("features")
>>
>>
>>
>> ·         Then I create a new dataframe with only 3 colums as :
>>
>> *val* training = finalDf.select("label", "censor", "features")
>>
>>
>>
>> ·         Finally Im passing it to AFT
>>
>> *val* model = aft.fit(training)
>>
>>
>>
>> Im getting the error as :
>>
>> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
>> infinity. Error for unknown reason.*
>>
>>        at scala.Predef$.assert(*Predef.scala:179*)
>>
>>        at org.apache.spark.ml.regression.AFTAggregator.add(
>> *AFTSurvivalRegression.scala:480*)
>>
>>        at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:522*)
>>
>>        at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:521*)
>>
>>        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>>        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>>        at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>>
>>
>>
>> I have tried to print the schema :
>>
>> ()root
>>
>> |-- label: double (nullable = true)
>>
>> |-- censor: double (nullable = true)
>>
>> |-- features: vector (nullable = true)
>>
>>
>>
>> Sample data training looks like
>>
>> [59.0,1.0,[72.3315,2.0,1.0,1.0]]
>>
>> [115.0,1.0,[74.4932,2.0,1.0,1.0]]
>>
>> [156.0,1.0,[66.4658,2.0,1.0,2.0]]
>>
>> [421.0,0.0,[53.3644,2.0,2.0,1.0]]
>>
>> [431.0,1.0,[50.3397,2.0,1.0,1.0]]
>>
>>
>>
>> Im not able to understand about the error, as if I use same data and
>> create the denseVector as given in Sample example of AFT, then code works
>> completely fine. But I would like to read the data from CSV file and then
>> proceed.
>>
>>
>>
>> Please suggest
>>
>>
>>
>> Thanks &Regards
>>
>> Stuti Awasthi
>>
>>
>>
>>
>>
>> ::DISCLAIMER::
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and
>> intended for the named recipient(s) only.
>> E-mail transmission is not guaranteed to be secure or error-free as
>> information could be intercepted, corrupted,
>> lost, destroyed, arrive late or incomplete, or may contain viruses in
>> transmission. The e mail and its contents
>> (with or without referred errors) shall therefore not attach any
>> liability on the originator or HCL or its affiliates.
>> Views or opinions, if any, presented in this email are solely those of
>> the author and may not necessarily reflect the
>> views or opinions of HCL or its affiliates. Any form of reproduction,
>> dissemination, copying, disclosure, modification,
>> distribution and / or publication of this message without the prior
>> written consent of authorized representative of
>> HCL is strictly prohibited. If you have received this email in error
>> please delete it and notify the sender immediately.
>> Before opening any email and/or attachments, please check them for
>> viruses and other defects.
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>
>

Reply via email to