Gaurav,

Is your data partitioned by date? If so, you can compact subsets of
partitions at a time. To do this using the Spark procedure, you pass a
where clause:

spark.sql("CALL catalog_name.system.rewrite_data_files(table => '...',
where => '...')")

If you use the RewriteDataFilesSparkAction, you call filter(Expression),
but then you have to pass in your where clause as an Iceberg Expression.
You can use
https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/spark/v3.3/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala
as shown in
https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java#L133-L135
.

- Wing Yew


On Tue, May 23, 2023 at 10:13 PM Gaurav Agarwal <gaurav130...@gmail.com>
wrote:

>
> On Wed, May 24, 2023, 10:41 AM Gaurav Agarwal <gaurav130...@gmail.com>
> wrote:
>
>> I have one more query we are trying to compact files currently it is
>> taking time as have never compacted till now this is the first time we are
>> trying to perform compaction after 5 months of continuously loading data
>> We change the format of the table from 1 to 2 also in bwtween
>> The issue is we are sparkrewriteaction Java API to perform the collate
>> but it is taking 24 hours for us to complete the job will there be a way in
>> that api that i can pass date range options are there but what parameters
>> should i pass there to make it date range
>>
>> Thanks
>>
>

Reply via email to