Hi Stuti, The features should be standardized before training the model. Currently AFTSurvivalRegression does not support standardization. Here is the work around for this issue, and I will send a PR to fix this issue soon.
val ovarian = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("......") .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps") val assembler = new VectorAssembler() .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps")) .setOutputCol("features") val ovarian2 = assembler.transform(ovarian) .select(col("censor").cast(DoubleType), col("label").cast(DoubleType), col("features")) val standardScaler = new StandardScaler() .setInputCol("features") .setOutputCol("standardized_features") val ssModel = standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 } println(newCoefficients.toSeq.mkString(",")) println(model.intercept) println(model.scale) Yanbo 2016-02-15 16:07 GMT+08:00 Yanbo Liang <yblia...@gmail.com>: > Hi Stuti, > > This is a bug of AFTSurvivalRegression, we did not handle "lossSum == > infinity" properly. > I have open https://issues.apache.org/jira/browse/SPARK-13322 to track > this issue and will send a PR. > Thanks for reporting this issue. > > Yanbo > > 2016-02-12 15:03 GMT+08:00 Stuti Awasthi <stutiawas...@hcl.com>: > >> Hi All, >> >> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able >> to run the AFT example provided. Now I tried to train the model with >> Ovarian data which is standard data comes with Survival library in R. >> >> Default Column Name : *Futime,fustat,age,resid_ds,rx,ecog_ps* >> >> >> >> Here are the steps I have done : >> >> · Loaded the data from csv to dataframe labeled as >> >> *val* ovarian_data = sqlContext.read >> >> .format("com.databricks.spark.csv") >> >> .option("header", "true") // Use first line of all files as header >> >> .option("inferSchema", "true") // Automatically infer data types >> >> .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", >> "rx", "ecog_ps") >> >> · Utilize the VectorAssembler() to create features from "age", >> "resid_ds", "rx", "ecog_ps" like >> >> *val* assembler = *new* VectorAssembler() >> >> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps")) >> >> .setOutputCol("features") >> >> >> >> · Then I create a new dataframe with only 3 colums as : >> >> *val* training = finalDf.select("label", "censor", "features") >> >> >> >> · Finally Im passing it to AFT >> >> *val* model = aft.fit(training) >> >> >> >> Im getting the error as : >> >> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is >> infinity. Error for unknown reason.* >> >> at scala.Predef$.assert(*Predef.scala:179*) >> >> at org.apache.spark.ml.regression.AFTAggregator.add( >> *AFTSurvivalRegression.scala:480*) >> >> at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply( >> *AFTSurvivalRegression.scala:522*) >> >> at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply( >> *AFTSurvivalRegression.scala:521*) >> >> at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply( >> *TraversableOnce.scala:144*) >> >> at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply( >> *TraversableOnce.scala:144*) >> >> at scala.collection.Iterator$class.foreach(*Iterator.scala:727*) >> >> >> >> I have tried to print the schema : >> >> ()root >> >> |-- label: double (nullable = true) >> >> |-- censor: double (nullable = true) >> >> |-- features: vector (nullable = true) >> >> >> >> Sample data training looks like >> >> [59.0,1.0,[72.3315,2.0,1.0,1.0]] >> >> [115.0,1.0,[74.4932,2.0,1.0,1.0]] >> >> [156.0,1.0,[66.4658,2.0,1.0,2.0]] >> >> [421.0,0.0,[53.3644,2.0,2.0,1.0]] >> >> [431.0,1.0,[50.3397,2.0,1.0,1.0]] >> >> >> >> Im not able to understand about the error, as if I use same data and >> create the denseVector as given in Sample example of AFT, then code works >> completely fine. But I would like to read the data from CSV file and then >> proceed. >> >> >> >> Please suggest >> >> >> >> Thanks &Regards >> >> Stuti Awasthi >> >> >> >> >> >> ::DISCLAIMER:: >> >> ---------------------------------------------------------------------------------------------------------------------------------------------------- >> >> The contents of this e-mail and any attachment(s) are confidential and >> intended for the named recipient(s) only. >> E-mail transmission is not guaranteed to be secure or error-free as >> information could be intercepted, corrupted, >> lost, destroyed, arrive late or incomplete, or may contain viruses in >> transmission. The e mail and its contents >> (with or without referred errors) shall therefore not attach any >> liability on the originator or HCL or its affiliates. >> Views or opinions, if any, presented in this email are solely those of >> the author and may not necessarily reflect the >> views or opinions of HCL or its affiliates. Any form of reproduction, >> dissemination, copying, disclosure, modification, >> distribution and / or publication of this message without the prior >> written consent of authorized representative of >> HCL is strictly prohibited. If you have received this email in error >> please delete it and notify the sender immediately. >> Before opening any email and/or attachments, please check them for >> viruses and other defects. >> >> >> ---------------------------------------------------------------------------------------------------------------------------------------------------- >> > >