date:20250106

Re: [DISCUSS] Hive Support

2025-01-06 Thread Wing Yew Poon

FYI --
It looks like the built-in Hive version in the master branch of Apache
Spark is 2.3.10 (https://issues.apache.org/jira/browse/SPARK-47018), and
https://issues.apache.org/jira/browse/SPARK-44114 (upgrade built-in Hive to
3+) is an open issue.


On Mon, Jan 6, 2025 at 1:07 PM Wing Yew Poon  wrote:

> Hi Peter,
> In Spark, you can specify the Hive version of the metastore that you want
> to use. There is a configuration, spark.sql.hive.metastore.version, which
> currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this
> default version are shipped with Spark as built-in. You can specify a
> different version and then specify spark.sql.hive.metastore.jars=path (the
> default is built-in) and spark.sql.hive.metastore.jars.path to point to
> jars for the Hive metastore version you want to use. What
> https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x
> to be supported as a spark.sql.hive.metastore.version. I haven't been
> following Spark 4, but I suspect that the built-in version is not changing
> to Hive 4.0. The built-in version is also used for other things that Spark
> may use from Hive (aside from interaction with HMS), such as Hive SerDes.
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> .
> - Wing Yew
>
>
> On Mon, Jan 6, 2025 at 2:04 AM Péter Váry 
> wrote:
>
>> Hi Manu,
>>
>> > Spark has only added hive 4.0 metastore support recently for Spark
>> 4.0[1] and there will be conflicts
>>
>> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
>> Hive 2 when it is present on the classpath, but if older Hive versions are
>> not on the classpath then it will use the embedded Hive 4 code?
>>
>> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change
>>
>> Is this a huge change even after we remove the Hive runtime module?
>>
>> After removing the Hive runtime module, we have 2 remaining Hive
>> dependencies:
>>
>>- HMS Client
>>   - The Thrift API should not be changed between the Hive versions,
>>   so unless we start to use specific Hive 4 features we should be fine 
>> here -
>>   so whatever version of Hive we use, it should work
>>   - Java API changes. We found that in Hive 2, and Hive 3 the
>>   HMSClient classes used different constructors so we ended up using
>>   DynMethods to use the appropriate constructors - if we use a strict 
>> Hive
>>   version here, then we won't need the DynMethods anymore
>>   - Based on our experience, even if Hive 3 itself doesn't support
>>   Java 11, the HMS Client for Hive 3 doesn't have any issues when used 
>> with
>>   Java 11
>>- Testing infrastructure
>>   - TestHiveMetastore creates and starts a HMS instance. This could
>>   be highly dependent on the version of Hive we are using. Since this is 
>> only
>>   a testing code I expect that only our tests are interacting with this
>>
>> *@Manu*: You know more of the details here. Do we have HMSClient issues
>> when we use Hive 4 code? If I miss something in the listing above, please
>> correct me.
>>
>> Based on this, in an ideal world:
>>
>>- Hive would provide a HMS client jar which only contains java code
>>which is needed to connect and communicate using Thrift with a HMS 
>> instance
>>(no internal HMS server code etc). We could use it as a dependency for our
>>iceberg-hive-metastore module. Either setting a minimal version, or using 
>> a
>>shaded embedded version. *@Hive* folks - is this a valid option? What
>>are the reasons that there is no metastore-client jar provided currently?
>>Would it be possible to generate one in some of the future Hive releases.
>>Seems like a worthy feature for me.
>>- We would create our version dependent HMS infrastructure if we want
>>to support Spark versions which support older Hive versions.
>>
>> As a result of this, we could have:
>>
>>- Clean definition of which Hive version is supported
>>- Testing for the supported Hive versions
>>- Java 11 support
>>
>> As an alternative we can create a testing matrix where some tests are run
>> with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
>> Spark versions which does not support Hive 4)
>>
>> Thanks Manu for driving this!
>> Peter
>>
>> Manu Zhang  ezt írta (időpont: 2025. jan. 5.,
>> V, 5:18):
>>
>>> This basically means that we need to support every exact Hive versions
 which are used by Spark, and we need to exclude our own Hive version from
 the Spark runtime.
>>>
>>>
>>> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect
>>> compatibility to be much better once Iceberg and Spark are both on Hive 4.
>>>
>>> Secondly, the coupling can be loosed if we are moving toward the REST
>>> catalog.
>>>
>>> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry 
>>> wrote:
>>>
 That sounds really interesting in a bad way :) :(

 This basically means that we need to support

Re: [DISCUSS] Hive Support

2025-01-06 Thread Wing Yew Poon

Hi Peter,
In Spark, you can specify the Hive version of the metastore that you want
to use. There is a configuration, spark.sql.hive.metastore.version, which
currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this
default version are shipped with Spark as built-in. You can specify a
different version and then specify spark.sql.hive.metastore.jars=path (the
default is built-in) and spark.sql.hive.metastore.jars.path to point to
jars for the Hive metastore version you want to use. What
https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x to
be supported as a spark.sql.hive.metastore.version. I haven't been
following Spark 4, but I suspect that the built-in version is not changing
to Hive 4.0. The built-in version is also used for other things that Spark
may use from Hive (aside from interaction with HMS), such as Hive SerDes.
See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html.
- Wing Yew


On Mon, Jan 6, 2025 at 2:04 AM Péter Váry 
wrote:

> Hi Manu,
>
> > Spark has only added hive 4.0 metastore support recently for Spark
> 4.0[1] and there will be conflicts
>
> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
> Hive 2 when it is present on the classpath, but if older Hive versions are
> not on the classpath then it will use the embedded Hive 4 code?
>
> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change
>
> Is this a huge change even after we remove the Hive runtime module?
>
> After removing the Hive runtime module, we have 2 remaining Hive
> dependencies:
>
>- HMS Client
>   - The Thrift API should not be changed between the Hive versions,
>   so unless we start to use specific Hive 4 features we should be fine 
> here -
>   so whatever version of Hive we use, it should work
>   - Java API changes. We found that in Hive 2, and Hive 3 the
>   HMSClient classes used different constructors so we ended up using
>   DynMethods to use the appropriate constructors - if we use a strict Hive
>   version here, then we won't need the DynMethods anymore
>   - Based on our experience, even if Hive 3 itself doesn't support
>   Java 11, the HMS Client for Hive 3 doesn't have any issues when used 
> with
>   Java 11
>- Testing infrastructure
>   - TestHiveMetastore creates and starts a HMS instance. This could
>   be highly dependent on the version of Hive we are using. Since this is 
> only
>   a testing code I expect that only our tests are interacting with this
>
> *@Manu*: You know more of the details here. Do we have HMSClient issues
> when we use Hive 4 code? If I miss something in the listing above, please
> correct me.
>
> Based on this, in an ideal world:
>
>- Hive would provide a HMS client jar which only contains java code
>which is needed to connect and communicate using Thrift with a HMS instance
>(no internal HMS server code etc). We could use it as a dependency for our
>iceberg-hive-metastore module. Either setting a minimal version, or using a
>shaded embedded version. *@Hive* folks - is this a valid option? What
>are the reasons that there is no metastore-client jar provided currently?
>Would it be possible to generate one in some of the future Hive releases.
>Seems like a worthy feature for me.
>- We would create our version dependent HMS infrastructure if we want
>to support Spark versions which support older Hive versions.
>
> As a result of this, we could have:
>
>- Clean definition of which Hive version is supported
>- Testing for the supported Hive versions
>- Java 11 support
>
> As an alternative we can create a testing matrix where some tests are run
> with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
> Spark versions which does not support Hive 4)
>
> Thanks Manu for driving this!
> Peter
>
> Manu Zhang  ezt írta (időpont: 2025. jan. 5., V,
> 5:18):
>
>> This basically means that we need to support every exact Hive versions
>>> which are used by Spark, and we need to exclude our own Hive version from
>>> the Spark runtime.
>>
>>
>> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect
>> compatibility to be much better once Iceberg and Spark are both on Hive 4.
>>
>> Secondly, the coupling can be loosed if we are moving toward the REST
>> catalog.
>>
>> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry 
>> wrote:
>>
>>> That sounds really interesting in a bad way :) :(
>>>
>>> This basically means that we need to support every exact Hive versions
>>> which are used by Spark, and we need to exclude our own Hive version from
>>> the Spark runtime.
>>>
>>> On Thu, Dec 19, 2024, 04:00 Manu Zhang  wrote:
>>>
 Hi Peter,

> I think we should make sure that the Iceberg Hive version is
> independent from the version used by Spark

  I'm afraid that is not how it works currently. When Spark is deployed
 with hive libraries (I suppose thi

[DISCUSS] Support keeping at most N snapshots

2025-01-06 Thread Manu Zhang

Hi all,

While maintaining Iceberg tables for our customers, I find it's difficult
to set a default snapshot expiration time
(`history.expire.max-snapshot-age-ms`) for different workloads. The default
value of 5 days looks good for daily batch jobs but is too long for
frequently-updated jobs.

I'm thinking about adding another option like
`history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
snapshot will be removed when either its age is larger than
`history.expire.max-snapshot-age-ms` or it's the oldest in
`history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
PR to demo the idea[1].

If you agree this is a valid feature request, we also need to update
SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
compatibility issue or too much cost to maintain compatibility? My
experiment shows many parsers need to be updated.

I'd like to hear your thoughts on this.

1. https://github.com/apache/iceberg/pull/11879
2. https://iceberg.apache.org/spec/#snapshot-references

Happy New Year!
Manu

Re: [DISCUSS] Support keeping at most N snapshots

2025-01-06 Thread Ajantha Bhat

Hi Manu,

We already have `retain_last` and `history.expire.min-snapshots-to-keep` to
retain the snapshots based on count. Can you please elaborate on why can't
we use the same?

- Ajantha

On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa 
wrote:

> Thanks Manu for starting this discussion. That is definitely a valid
> feature. I have always found maintaining snapshots by day makes it harder
> to provide different types of guarantees/contracts especially when tables
> change rates are diverse or irregular. Maintaining by snapshot count makes
> a lot of sense and prevents table sizes from growing excessively when
> change rate is frequent.
>
> Thanks,
> Walaa.
>
>
> On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang  wrote:
>
>> Hi all,
>>
>> While maintaining Iceberg tables for our customers, I find it's difficult
>> to set a default snapshot expiration time
>> (`history.expire.max-snapshot-age-ms`) for different workloads. The default
>> value of 5 days looks good for daily batch jobs but is too long for
>> frequently-updated jobs.
>>
>> I'm thinking about adding another option like
>> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
>> snapshot will be removed when either its age is larger than
>> `history.expire.max-snapshot-age-ms` or it's the oldest in
>> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
>> PR to demo the idea[1].
>>
>> If you agree this is a valid feature request, we also need to update
>> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
>> compatibility issue or too much cost to maintain compatibility? My
>> experiment shows many parsers need to be updated.
>>
>> I'd like to hear your thoughts on this.
>>
>> 1. https://github.com/apache/iceberg/pull/11879
>> 2. https://iceberg.apache.org/spec/#snapshot-references
>>
>> Happy New Year!
>> Manu
>>
>

Re: [DISCUSS] Support keeping at most N snapshots

2025-01-06 Thread Manu Zhang

Hi Ajantha,

`history.expire.min-snapshots-to-keep` is the *minimum number of snapshots*
we can keep. I'm proposing to decide the *maximum number of snapshots* to
keep by count rather than by age.

Thanks,
Manu

On Tue, Jan 7, 2025 at 2:18 PM Ajantha Bhat  wrote:

> Hi Manu,
>
> We already have `retain_last` and `history.expire.min-snapshots-to-keep`
> to retain the snapshots based on count. Can you please elaborate on why
> can't we use the same?
>
> - Ajantha
>
> On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Thanks Manu for starting this discussion. That is definitely a valid
>> feature. I have always found maintaining snapshots by day makes it harder
>> to provide different types of guarantees/contracts especially when tables
>> change rates are diverse or irregular. Maintaining by snapshot count makes
>> a lot of sense and prevents table sizes from growing excessively when
>> change rate is frequent.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang 
>> wrote:
>>
>>> Hi all,
>>>
>>> While maintaining Iceberg tables for our customers, I find it's
>>> difficult to set a default snapshot expiration time
>>> (`history.expire.max-snapshot-age-ms`) for different workloads. The default
>>> value of 5 days looks good for daily batch jobs but is too long for
>>> frequently-updated jobs.
>>>
>>> I'm thinking about adding another option like
>>> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
>>> snapshot will be removed when either its age is larger than
>>> `history.expire.max-snapshot-age-ms` or it's the oldest in
>>> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
>>> PR to demo the idea[1].
>>>
>>> If you agree this is a valid feature request, we also need to update
>>> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
>>> compatibility issue or too much cost to maintain compatibility? My
>>> experiment shows many parsers need to be updated.
>>>
>>> I'd like to hear your thoughts on this.
>>>
>>> 1. https://github.com/apache/iceberg/pull/11879
>>> 2. https://iceberg.apache.org/spec/#snapshot-references
>>>
>>> Happy New Year!
>>> Manu
>>>
>>

Re: [DISCUSS] Support keeping at most N snapshots

2025-01-06 Thread Walaa Eldin Moustafa

Thanks Manu for starting this discussion. That is definitely a valid
feature. I have always found maintaining snapshots by day makes it harder
to provide different types of guarantees/contracts especially when tables
change rates are diverse or irregular. Maintaining by snapshot count makes
a lot of sense and prevents table sizes from growing excessively when
change rate is frequent.

Thanks,
Walaa.


On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang  wrote:

> Hi all,
>
> While maintaining Iceberg tables for our customers, I find it's difficult
> to set a default snapshot expiration time
> (`history.expire.max-snapshot-age-ms`) for different workloads. The default
> value of 5 days looks good for daily batch jobs but is too long for
> frequently-updated jobs.
>
> I'm thinking about adding another option like
> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
> snapshot will be removed when either its age is larger than
> `history.expire.max-snapshot-age-ms` or it's the oldest in
> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
> PR to demo the idea[1].
>
> If you agree this is a valid feature request, we also need to update
> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
> compatibility issue or too much cost to maintain compatibility? My
> experiment shows many parsers need to be updated.
>
> I'd like to hear your thoughts on this.
>
> 1. https://github.com/apache/iceberg/pull/11879
> 2. https://iceberg.apache.org/spec/#snapshot-references
>
> Happy New Year!
> Manu
>

Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

2025-01-06 Thread Jean-Baptiste Onofré

Hi Gabor

I did a new pass on the proposal and it looks good to me. Great work !

I'm volunteer to work with you on the spec PR according to the doc.

Thoughts ?

Regards
JB

On Thu, Dec 19, 2024 at 11:09 AM Gabor Kaszab  wrote:
>
> Hi All,
>
> Just an update that the proposal went through some iterations based on the 
> comments from Daniel Weeks. Thanks for taking a look, Daniel!
>
> In a nutshell this is what changed compared to the original proposal:
> - The Catalog API will be intact, there is no proposed new API function now. 
> With this the freshness aware functionality and the ETags in particular will 
> not be exposed to the clients of the API.
> - Instead of storing the ETags in TableMetadata we propose to store it in 
> RESTTableOperations since the proposal only focuses on the REST catalog. The 
> very same changes can be done on other TableOperations implementations if 
> there is going to be a need to have this for other catalogs too.
> - A SoftReference cache of (TableIdentifier -> Table object) is introduced on 
> the RESTSessionCatalog level. This can be used for providing previous ETags 
> to the HTTPClient and also to answer Catalog API calls with the latest table 
> metadata if the REST server returns a '304 Not Modified'.
>
> The doc is updated with the above now:
> https://docs.google.com/document/d/1rnVSP_iv2I47giwfAe-Z3DYhKkKwWCVvCkC9rEvtaLA
>
> While I keep the discussion still open, I think I'll move on to take care of 
> the changes required for the REST spec. Will send a PR for this soon.
>
> Regards,
> Gabor
>
>
> On Thu, Dec 12, 2024 at 4:07 PM Jean-Baptiste Onofré  
> wrote:
>>
>> Hi Gabor
>>
>> Thanks for the update ! I will take a look.
>>
>> Regards
>> JB
>>
>> On Thu, Dec 12, 2024 at 2:52 PM Gabor Kaszab  wrote:
>> >
>> > Hi Iceberg Community,
>> >
>> > It took me a while but I finally managed to upload the proposal for this 
>> > as an official 'Iceberg improvement proposal'. Thanks for the feedback so 
>> > far!
>> >
>> > https://github.com/apache/iceberg/issues/11766
>> >
>> > Regards,
>> > Gabor
>> >
>> >
>> > On Fri, Nov 22, 2024 at 4:51 PM Taeyun Kim  
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Since ETags are opaque values to the client, attributing any semantic 
>> >> meaning to them in the interaction between the client and server would, 
>> >> in my opinion, constitute a misuse/abuse of the HTTP specification.
>> >> On the other hand, the server can generate the ETag value as any string, 
>> >> as long as it conforms to the grammar defined in 
>> >> https://httpwg.org/specs/rfc9110.html#field.etag . Using the metadata 
>> >> location is likely the simplest option. For reference, based on the 
>> >> grammar, ETag values cannot include spaces. Therefore, if the metadata 
>> >> location contains spaces, it may need to be encoded. The same goes for 
>> >> double quotation marks. (I just found this out after looking it up.)
>> >> Anyway, in my opinion, the client must ignore any semantic meaning 
>> >> associated with the value.
>> >>
>> >> Thank you.
>> >>
>> >> -Original Message-
>> >> From:  "Zoltán Borók-Nagy" 
>> >> To:  ;
>> >> Cc:
>> >> Sent:  2024-11-22 (금) 19:57:08 (UTC+09:00)
>> >> Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the 
>> >> latest
>> >>
>> >> Hi,
>> >>
>> >> Separate version information forces the clients to manage a Table ->
>> >> VersionIdentifier mapping which adds unnecessary complexity and can be
>> >> error-prone.
>> >>
>> >> If the VersionIdentifier is embedded in the Table object then the
>> >> application logic is much simpler, and the Catalog interface is not
>> >> only simpler, but also hard to use incorrectly.
>> >> Though this approach slightly increases the size of the Table objects.
>> >> And touching the Table interface might encounter some resistance, even
>> >> if it is only an extension.
>> >>
>> >> Yeah, VersionIdentifier doesn't need to be a String, it could be an
>> >> Object, or an empty interface, and the Catalog implementation could
>> >> cast it to some catalog-specific VersionIdentifierImpl.
>> >>
>> >> loadTableIfChanged() throwing UnsupportedOperationException is
>> >> reasonable, as clients can easily fallback to loadTable. In my mind I
>> >> had a use case where we cache tables without any refresh checks for a
>> >> configured TTL, and after expiration we invoke reloadTable() anyway.
>> >> But this use case can also be implemented even if loadTableIfChanged()
>> >> throws exceptions, making this approach more flexible.
>> >>
>> >> About metadata_location as ETag: I don't have a strong opinion here,
>> >> not sure what could go wrong if we do this. If we start with this
>> >> approach we don't even need a VersionIdentifier for Tables, making the
>> >> whole proposal more lightweight.
>> >>
>> >> Thanks Gabor for driving this and putting together a proposal!
>> >>
>> >> Cheers,
>> >> Zoltan
>> >>
>> >> On Fri, Nov 22, 2024 at 11:42 AM Gabor Kaszab  
>> >> wrote:
>> >> >
>> >> > Hi T

Re: [DISCUSS] Hive Support

2025-01-06 Thread Péter Váry

Hi Manu,

> Spark has only added hive 4.0 metastore support recently for Spark 4.0[1]
and there will be conflicts

Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
Hive 2 when it is present on the classpath, but if older Hive versions are
not on the classpath then it will use the embedded Hive 4 code?

> Firstly, upgrading from Hive 2 to Hive 4 is a huge change

Is this a huge change even after we remove the Hive runtime module?

After removing the Hive runtime module, we have 2 remaining Hive
dependencies:

   - HMS Client
  - The Thrift API should not be changed between the Hive versions, so
  unless we start to use specific Hive 4 features we should be
fine here - so
  whatever version of Hive we use, it should work
  - Java API changes. We found that in Hive 2, and Hive 3 the HMSClient
  classes used different constructors so we ended up using
DynMethods to use
  the appropriate constructors - if we use a strict Hive version here, then
  we won't need the DynMethods anymore
  - Based on our experience, even if Hive 3 itself doesn't support Java
  11, the HMS Client for Hive 3 doesn't have any issues when used
with Java 11
   - Testing infrastructure
  - TestHiveMetastore creates and starts a HMS instance. This could be
  highly dependent on the version of Hive we are using. Since this
is only a
  testing code I expect that only our tests are interacting with this

*@Manu*: You know more of the details here. Do we have HMSClient issues
when we use Hive 4 code? If I miss something in the listing above, please
correct me.

Based on this, in an ideal world:

   - Hive would provide a HMS client jar which only contains java code
   which is needed to connect and communicate using Thrift with a HMS instance
   (no internal HMS server code etc). We could use it as a dependency for our
   iceberg-hive-metastore module. Either setting a minimal version, or using a
   shaded embedded version. *@Hive* folks - is this a valid option? What
   are the reasons that there is no metastore-client jar provided currently?
   Would it be possible to generate one in some of the future Hive releases.
   Seems like a worthy feature for me.
   - We would create our version dependent HMS infrastructure if we want to
   support Spark versions which support older Hive versions.

As a result of this, we could have:

   - Clean definition of which Hive version is supported
   - Testing for the supported Hive versions
   - Java 11 support

As an alternative we can create a testing matrix where some tests are run
with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
Spark versions which does not support Hive 4)

Thanks Manu for driving this!
Peter

Manu Zhang  ezt írta (időpont: 2025. jan. 5., V,
5:18):

> This basically means that we need to support every exact Hive versions
>> which are used by Spark, and we need to exclude our own Hive version from
>> the Spark runtime.
>
>
> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect
> compatibility to be much better once Iceberg and Spark are both on Hive 4.
>
> Secondly, the coupling can be loosed if we are moving toward the REST
> catalog.
>
> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry 
> wrote:
>
>> That sounds really interesting in a bad way :) :(
>>
>> This basically means that we need to support every exact Hive versions
>> which are used by Spark, and we need to exclude our own Hive version from
>> the Spark runtime.
>>
>> On Thu, Dec 19, 2024, 04:00 Manu Zhang  wrote:
>>
>>> Hi Peter,
>>>
 I think we should make sure that the Iceberg Hive version is
 independent from the version used by Spark
>>>
>>>  I'm afraid that is not how it works currently. When Spark is deployed
>>> with hive libraries (I suppose this is common), iceberg-spark runtime must
>>> be compatible with them.
>>> Otherwise, we need to ask users to exclude hive libraries from Spark and
>>> ship iceberg-spark runtime with Iceberg's hive dependencies.\
>>>
>>> Regards,
>>> Manu
>>>
>>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry 
>>> wrote:
>>>
 @Manu: What will be the end result? Do we have to use the same Hive
 version in Iceberg as it is defined by Spark? I think we should make sure
 that the Iceberg Hive version is independent from the version used by Spark

 On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com  wrote:

> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>
> We can at least separate the concerns. We can remove the runtime
> modules that are the main issue. If we compile against an older version of
> the Hive metastore module (leaving it unchanged) that at least has a
> dramatically reduced surface area for Java version issues. As long as the
> API is compatible (and we haven't heard complaints that it is not) then I
> think users can override the version in their environments.
>
> Ryan
>
> On Sun

Re: [DISCUSS] Hive Support

Re: [DISCUSS] Hive Support

[DISCUSS] Support keeping at most N snapshots

Re: [DISCUSS] Support keeping at most N snapshots

Re: [DISCUSS] Support keeping at most N snapshots

Re: [DISCUSS] Support keeping at most N snapshots

Re: [DISCUSS] REST: Way to query if metadata pointer is the latest

Re: [DISCUSS] Hive Support

8 matches

Site Navigation

Mail list logo

Footer information