I'm trying to read in some json, infer a schema, and write it out again as parquet to s3 (s3a). For some reason, about a third of the way through the writing portion of the run, spark always errors out with the error included below. I can't find any obvious reasons for the issue: it isn't out of memory; there are no long GC pauses. There don't seem to be any additional error messages in the logs of the individual executors.
The script runs fine on another set of data that I have, which is of a very similar structure, but several orders of magnitude smaller. I am running spark 2.0.1-hadoop-2.7 and am using the FileOutputCommitter. The algorithm version doesn't seem to matter. This does not appear to be a problem in badly formed json or corrupted files. I have unzipped and read in each file individually with no error. Here's a simplified version of the script: object Foo { def parseJson(json: String): Option[Map[String, Any]] = { if (json == null) Some(Map()) else parseOpt(json).map((j: JValue) => j.values.asInstanceOf[Map[String, Any]]) } } } // read in as text and parse json using json4s val jsonRDD: RDD[String] = sc.textFile(inputPath) .map(row -> Foo.parseJson(row)) // infer a schema that will encapsulate the most rows in a sample of size sampleRowNum val schema: StructType = Infer.getMostCommonSchema(sc, jsonRDD, sampleRowNum) // get documents compatibility with schema val jsonWithCompatibilityRDD: RDD[(String, Boolean)] = jsonRDD .map(js => (js, Infer.getSchemaCompatibility(schema, Infer.inferSchema(js)).toBoolean)) .repartition(partitions) val jsonCompatibleRDD: RDD[String] = jsonWithCompatibilityRDD .filter { case (js: String, compatible: Boolean) => compatible } .map { case (js: String, _: Boolean) => js } // create a dataframe from documents with compatible schema val dataFrame: DataFrame = spark.read.schema(schema).json(jsonCompatibleRDD) It completes the earlier schema inferring steps successfully. The error itself occurs on the last line, but I suppose that could encompass at least the immediately preceding statemnt, if not earlier: org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Failed to commit task at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:275) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:257) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) ... 8 more Suppressed: java.lang.NullPointerException at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetFileFormat.scala:569) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$abortTask$1(WriterContainer.scala:282) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$2.apply$mcV$sp(WriterContainer.scala:258) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1354) ... 9 more Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1147) at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1136) at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:400) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetFileFormat.scala:569) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:267) ... 13 more Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150) at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279) at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75) at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72) at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62) at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31) at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712) ... 29 more Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2; XML document structures must start and end within the same entity. at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source) at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141) ... 35 more Here's my conf: spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 16G spark.executor.uri https://s3.amazonaws.com/foo/spark-2.0.1-bin-hadoop2.7.tgz spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.buffer.dir /raid0/spark spark.hadoop.fs.s3n.buffer.dir /raid0/spark spark.hadoop.fs.s3a.connection.timeout 500000 spark.hadoop.fs.s3n.multipart.uploads.enabled true spark.hadoop.parquet.block.size 2147483648 spark.hadoop.parquet.enable.summary-metadata false spark.jars.packages com.databricks:spark-avro_2.11:3.0.1 spark.local.dir /raid0/spark spark.mesos.coarse false spark.mesos.constraints priority:1 spark.network.timeout 600 spark.rpc.message.maxSize 500 spark.speculation false spark.sql.parquet.mergeSchema false spark.sql.planner.externalSort true spark.submit.deployMode client spark.task.cpus 1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SAXParseException-while-writing-to-parquet-on-s3-tp28019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org