I agree that we should follow up and address this. I'll open a PR with an
update for docs and we can explore other options as well.

I think that the main issue here is how tables are loaded when not using a
catalog. When using a catalog, tables are cached so that writes update the
in-memory reference in the session. The bad behavior happens when keeping
an independent reference to a non-catalog table that won't get updated (as
in Dongjoon's reported case) or when writing with an independent reference
and reading using a catalog table -- in which case `REFRESH TABLE` solves
the problem.

Our [docs for Spark 3](
https://github.com/apache/iceberg/blob/master/site/docs/spark.md#querying-with-dataframes)
recommend the catalog-based methods when using the DataFrame reader and
writers, which avoids this. But people may still be using `load` and `save`
since those were the only available options in Spark 2.4, so if we can
convert those to use catalogs that would be ideal.

On Mon, Jul 13, 2020 at 4:28 PM Anton Okolnychyi <aokolnyc...@apple.com>
wrote:

> +1 (binding)
>
> I think the issue that was brought up by Dongjoon is valid and we should
> document the current caching behavior.
> The problem is also more generic and does not apply only to views as
> operations that are happening through the source directly may not
> propagated to the catalog.
>
> I think it is worth exploring SupportsCatalogOptions but that would be a
> substantial change. I’m in favor of releasing this RC so that people can
> start testing Iceberg with Spark 3.
>
> - Anton
>
>
> On 13 Jul 2020, at 11:55, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> One more thing: a work-around is to redefine the view. That discards the
> original logical plan and table and returns the expected result in Spark 3.
>
> On Mon, Jul 13, 2020 at 11:53 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Dongjoon,
>>
>> Thanks for raising this issue. I did some digging and the problem is that
>> in Spark 3.0, the logical plans saves a Table instance with the current
>> state when it was loaded -- when the `createOrReplaceTempView` call
>> happened. That never gets refreshed, which is why you get stale data. In
>> 2.4, there is no "table" in Spark so it gets reloaded every time.
>>
>> In the SQL path, this problem is avoided because the same table instance
>> is returned to Spark when the table is loaded from the catalog because of
>> table caching. The write will use the same table instance and keep it up to
>> date, and a call to `REFRESH TABLE` would also update it.
>>
>> I'm not sure the right way to avoid this behavior. We purposely don't
>> refresh the table each time Spark plans a scan because we want the version
>> of a table that is used to be consistent. A query that performs a
>> self-join, for example, should always use the same version of the table. On
>> the other hand, when a table reference is held in a logical plan like this,
>> there is no mechanism to update it. Maybe each table instance should keep a
>> timer and refresh after some interval?
>>
>> Either way, I agree that this is something we should add to
>> documentation. I don't think that this should fail the release, but we
>> should fix it by the next one. Does that sound reasonable to you?
>>
>> On Sun, Jul 12, 2020 at 6:16 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> I verified the hash/sign/build/UT and manual testing with Apache Spark
>>> 2.4.6 (hadoop-2.7) and 3.0.0 (hadoop-3.2) on Apache Hive 2.3.7 metastore.
>>> (BTW, for spark-3.0.0-bin-hadoop3.2, I used Ryan's example command with
>>> `spark.sql.warehouse.dir` instead of `spark.warehouse.path`)
>>>
>>> 1. Iceberg 0.9 + Spark 2.4.6 works as expected.
>>> 2. Iceberg 0.9 + Spark 3.0.0 works as expected mostly, but views work
>>> differently from Iceberg 0.9 + Spark 2.4.6.
>>>
>>> The following is the example.
>>>
>>> ...
>>> scala>
>>> spark.read.format("iceberg").load("/tmp/t1").createOrReplaceTempView("t2")
>>>
>>> scala> sql("select * from t2").show
>>> +---+---+
>>> |  a|  b|
>>> +---+---+
>>> +---+---+
>>>
>>> scala> Seq(("a", 1), ("b", 2)).toDF("a",
>>> "b").write.format("iceberg").mode("overwrite").save("/tmp/t1")
>>>
>>> scala> sql("select * from t2").show // Iceberg 0.9 with Spark 2.4.6
>>> shows the updated result correctly here.
>>> +---+---+
>>> |  a|  b|
>>> +---+---+
>>> +---+---+
>>>
>>> scala> spark.read.format("iceberg").load("/tmp/t1").show
>>> +---+---+
>>> |  a|  b|
>>> +---+---+
>>> |  a|  1|
>>> |  b|  2|
>>> +---+---+
>>>
>>> So far, I'm not sure which part (Iceberg or Spark) introduces this
>>> difference,
>>> but it would be nice if we had some document about the difference
>>> between versions if this is designed like this.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> On Sun, Jul 12, 2020 at 12:32 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> Verified sigs/sums/license/build/test
>>>>
>>>> I did have an issue with the test metastore for the spark3 tests on the
>>>> first run, but couldn't replicate it in subsequent tests.
>>>>
>>>> -Dan
>>>>
>>>> On Fri, Jul 10, 2020 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> Verified checksums, ran tests, staged convenience binaries.
>>>>>
>>>>> I also ran a few tests using Spark 3.0.0 and Spark 2.4.5 and the
>>>>> runtime Jars. For anyone that would like to use spark-sql or spark-shell,
>>>>> here are the commands that I used:
>>>>>
>>>>> ~/Apps/spark-3.0.0-bin-hadoop2.7/bin/spark-sql \
>>>>>     --repositories 
>>>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008 \
>>>>>     --packages org.apache.iceberg:iceberg-spark3-runtime:0.9.0 \
>>>>>     --conf spark.warehouse.path=$PWD/spark-warehouse \
>>>>>     --conf spark.hadoop.hive.metastore.uris=thrift://localhost:42745 \
>>>>>     --conf 
>>>>> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
>>>>>  \
>>>>>     --conf spark.sql.catalog.spark_catalog.type=hive \
>>>>>     --conf 
>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_prod.type=hive \
>>>>>     --conf 
>>>>> spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hadoop_prod.type=hadoop \
>>>>>     --conf spark.sql.catalog.hadoop_prod.warehouse=$PWD/hadoop-warehouse
>>>>>
>>>>> ~/Apps/spark-2.4.5-bin-hadoop2.7/bin/spark-shell \
>>>>>     --repositories 
>>>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008 \
>>>>>     --packages org.apache.iceberg:iceberg-spark-runtime:0.9.0 \
>>>>>     --conf spark.hadoop.hive.metastore.uris=thrift://localhost:42745 \
>>>>>     --conf spark.warehouse.path=$PWD/spark-warehouse
>>>>>
>>>>> The Spark 3 command sets up 3 catalogs:
>>>>>
>>>>>    - A wrapper around the built-in catalog, spark_catalog, that adds
>>>>>    support for Iceberg tables
>>>>>    - A hive_prod Iceberg catalog that uses the same metastore as the
>>>>>    session catalog
>>>>>    - A hadoop_prod Iceberg catalog that stores tables in a
>>>>>    hadoop-warehouse folder
>>>>>
>>>>> Everything worked great, except for a minor issue with CTAS, #1194
>>>>> <https://github.com/apache/iceberg/pull/1194>. I’m okay to release
>>>>> with that issue, but we can always build a new RC if anyone thinks it is a
>>>>> blocker.
>>>>>
>>>>> If you’d like to run tests in a downstream project, you can use the
>>>>> staged binary artifacts by adding this to your gradle build:
>>>>>
>>>>>   repositories {
>>>>>     maven {
>>>>>       name 'stagedIceberg'
>>>>>       url 
>>>>> 'https://repository.apache.org/content/repositories/orgapacheiceberg-1008/'
>>>>>     }
>>>>>   }
>>>>>
>>>>>   ext {
>>>>>     icebergVersion = '0.9.0'
>>>>>   }
>>>>>
>>>>>
>>>>> On Fri, Jul 10, 2020 at 9:20 AM Ryan Murray <rym...@dremio.com> wrote:
>>>>>
>>>>>> 1. Verify the signature: OK
>>>>>> 2. Verify the checksum: OK
>>>>>> 3. Untar the archive tarball: OK
>>>>>> 4. Run RAT checks to validate license headers: RAT checks passed
>>>>>> 5. Build and test the project: all unit tests passed.
>>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> I did see that my build took >12 minutes and used all 100% of all 8
>>>>>> cores & 32GB of memory (openjdk-8 ubuntu 18.04) which I haven't noticed
>>>>>> before.
>>>>>> On Fri, Jul 10, 2020 at 4:37 AM OpenInx <open...@gmail.com> wrote:
>>>>>>
>>>>>>> I followed the verify guide here (
>>>>>>> https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E)
>>>>>>> :
>>>>>>>
>>>>>>> 1. Verify the signature: OK
>>>>>>> 2. Verify the checksum: OK
>>>>>>> 3. Untar the archive tarball: OK
>>>>>>> 4. Run RAT checks to validate license headers: RAT checks passed
>>>>>>> 5. Build and test the project: all unit tests passed.
>>>>>>>
>>>>>>> +1 (non-binding).
>>>>>>>
>>>>>>> On Fri, Jul 10, 2020 at 9:46 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I propose the following RC to be released as the official Apache
>>>>>>>> Iceberg 0.9.0 release.
>>>>>>>>
>>>>>>>> The commit id is 4e66b4c10603e762129bc398146e02d21689e6dd
>>>>>>>> * This corresponds to the tag: apache-iceberg-0.9.0-rc5
>>>>>>>> *
>>>>>>>> https://github.com/apache/iceberg/commits/apache-iceberg-0.9.0-rc5
>>>>>>>> * https://github.com/apache/iceberg/tree/4e66b4c1
>>>>>>>>
>>>>>>>> The release tarball, signature, and checksums are here:
>>>>>>>> *
>>>>>>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/
>>>>>>>>
>>>>>>>> You can find the KEYS file here:
>>>>>>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>>>>>>
>>>>>>>> Convenience binary artifacts are staged in Nexus. The Maven
>>>>>>>> repository URL is:
>>>>>>>> *
>>>>>>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008/
>>>>>>>>
>>>>>>>> This release includes support for Spark 3 and vectorized reads for
>>>>>>>> flat schemas in Spark.
>>>>>>>>
>>>>>>>> Please download, verify, and test.
>>>>>>>>
>>>>>>>> Please vote in the next 72 hours.
>>>>>>>>
>>>>>>>> [ ] +1 Release this as Apache Iceberg 0.9.0
>>>>>>>> [ ] +0
>>>>>>>> [ ] -1 Do not release this because...
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to