Getting a partitioning error

Christine Mathiesen Fri, 11 Oct 2019 08:58:40 -0700

Hello!
I've been testing out the use of Iceberg with Spark by writing some basic Java 
test classes, but I've hit an issue with one of the methods from PartitionSpec 
. In a quick summary, I've got a small set of data that looks like this:


{"title":"Gone", "author": "Michael Grant", "published": 1541776051}
{"title":"Carry On", "author": "Rainbow Rowell", "published": 1536505651}
{"title":"Wayward Son", "author": "Rainbow Rowell", "published": 1504969651}

I would like to be able to partition this small set of data with:

PartitionSpec spec = PartitionSpec.builderFor(SCHEMA)
        .identity("author")
        .build();

However, when I try to run this simple test, I get the error:

java.lang.IllegalStateException: Already closed files for partition: 
author=Rainbow+Rowell
                at 
org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:505)
                at 
org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:476)
                at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run


With partitioning on `author`, I was expecting the 2 bottom records to go into 
one folder, and the first to go into another. Is there anything here that I'm 
doing wrong? I'm really stumped as to why I'm getting an error when trying to 
partition on that column.

For reference, here’s the rest of what’s in the test class.

    Schema SCHEMA = new Schema(
        optional(1, "title", Types.StringType.get()),
        optional(2, "author", Types.StringType.get()),
        optional(3, "published", Types.TimestampType.withZone())
    );

    HadoopTables tables = new HadoopTables(CONF);
    PartitionSpec spec = PartitionSpec.builderFor(SCHEMA)
        .identity("author")
        .build();

    Table table = tables.create(SCHEMA, spec, location.toString());

    Dataset<Row> df = spark.read().json("src/test/resources/books.json");

    df.select(df.col("title") df.col("author"), 
df.col("published").cast(DataTypes.TimestampType).write()
        .format("iceberg")
        .mode("append")
        .save(location.toString());

    table.refresh();

I'm using the spark-runtime-jar generated from commit: 
57b10995aade1362e582cef68b20014023556501 (from Oct 3) and I'm using Spark 
version 2.4.4.

Thank you for your time!!

Christine Mathiesen
Software Engineer Intern
BDP – Hotels.com
Expedia Group

Getting a partitioning error

Reply via email to