[
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15719863#comment-15719863
]
Takeshi Yamamuro commented on SPARK-18699:
------------------------------------------
you mean a query below and you'd like to load the second line only?
{code}
>> test.csv <<
1 0,2014-xx-xx
2 1,2014-01-01
scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("a", IntegerType).add("b", DateType)
scala> spark.read.format("csv").schema(schema).load("test.csv").show
16/12/04 21:21:56 ERROR Executor: Exception in task 0.0 in stage 32.0 (TID 32)
java.lang.NumberFormatException: For input string: "xx"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at java.sql.Date.valueOf(Date.java:134)
at
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$9.apply$mcI$sp(CSVInferSchema.scala:290)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$8.apply(CSVInferSchema.scala:290)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$8.apply(CSVInferSchema.scala:290)
at scala.util.Try.getOrElse(Try.scala:79)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:121)
at
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:90)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:173)
at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:172)
{code}
> Spark CSV parsing types other than String throws exception when malformed
> -------------------------------------------------------------------------
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception
> is thrown when the string value in CSV is malformed; e.g. if the timestamp
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
> at java.sql.Date.valueOf(Date.java:143)
> at
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
> at scala.util.Try.getOrElse(Try.scala:79)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
> ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the
> value or drop the line, but instead they kill the job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]