Re: support of RCFile

Jacques Nadeau Wed, 29 Sep 2021 18:58:42 -0700

I actually wonder if file formats should be an extension api so someone can
implement a file format but it without any changes in Iceberg core (I don't
think this is possible today). Let's say one wanted to create a proprietary
format but use Iceberg semantics (not me). Could we make it such that one
could do so by building an extension and leveraging off-the-shelf Iceberg?
That's seems the best option for something like RC file. For sure people
are going to have a desire to add new formats and given the pain of
rewriting large datasets but I'd hate to see lots of partially implemented
file formats in Iceberg proper. Better for people to build against an
extension api and have them serve the purposes they need. Maybe go so far
as the extension api only allows read, not write so that people don't do
crazy things...



On Wed, Sep 29, 2021 at 6:43 PM yuan youjun <yuanyou...@gmail.com> wrote:

> Hi Ryan and Russell
>
> Thanks very much for your response.
>
> well, I want ACID and row level update capability that icegerg provides. I
> believe data lake is a better way to manage our dataset, instead of hive.
> I also want our transition from hive to data lake is as smooth as
> possible, which means:
> 1, the transition should be transparent to consumers (dashboard, data
> scientist, downstream pipelines). If we start a new table with iceberg with
> new data, then those consumers will NOT be able to query old data (without
> splitting their queries into two, and combine the result).
> 2, do not impose significant infra cost. Convert historical data from
> RCFile into ORC or Parquet would be time consuming and costly (though it’s
> one time cost). I got your point that new format probably save our storage
> cost in the long term, this would be a separate interesting topic.
>
> Here is what in my mind now:
> 1, if iceberg support (or will support) legacy format, that would be ideal.
> 2, if not, is it possible for us to develop that feature (may be in a
> fork).
> 3, convert history data into new format should be our last sort, this way
> we need more evaluation.
>
>
> youjun
>
> 2021年9月30日 上午12:15，Ryan Blue <b...@tabular.io> 写道：
>
> Youjun, what are you trying to do?
>
> If you have existing tables in an incompatible format, you may just want
> to leave them as they are for historical data. It depends on why you want
> to use Iceberg. If you want to be able to query larger ranges of that data
> because you've clustered across files by filter columns, then you'd want to
> build the Iceberg metadata. But if you have a lot of historical data that
> hasn't been clustered and is unlikely to be rewritten, then keeping old
> tables in RCFile and doing new work in Iceberg could be a better option.
>
> You may also want to check how much savings you get out of using Iceberg
> with Parquet files vs RCFile. If you find that you can cluster your data
> for better queries and that ends up making your dataset considerably
> smaller then maybe it's worth the conversion that Russell suggested. RCFile
> is pretty old so I think there's a good chance you'd save a lot of space --
> just updating from an old compression codec to something more modern like
> snappy to lz4 or gzip to zstd could be a big win.
>
> Ryan
>
> On Wed, Sep 29, 2021 at 8:49 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> Within Iceberg it would take a bit of effort, we would need custom
>> readers at the minimum if we just wanted to make it ReadOnly support. I
>> think the main complexity would be designing the specific readers for the
>> platform you want to use like "Spark" or "Flink", the actual metadata
>> handling and such would probably be pretty straightforward. I would
>> definitely size it as at least a several week project and I'm not sure we
>> would want to support it in OSS Iceberg.
>>
>> On Wed, Sep 29, 2021 at 10:40 AM 袁尤军 <wdyuanyou...@163.com> wrote:
>>
>>> thanks for the suggestion. we need to evaluate the cost to convert the
>>> format, as those hive tables  have been there for many years, so PB data
>>> need to reformat.
>>>
>>> also, do you think it is possible to develop the support for a new
>>> format? how costly is it?
>>>
>>> 发自我的iPhone
>>>
>>> > 在 2021年9月29日，下午9:34，Russell Spitzer <russell.spit...@gmail.com> 写道：
>>> >
>>> > There is no plan I am aware of using RCFiles directly in Iceberg.
>>> While we could work to support other file formats, I don't think it is very
>>> widely used compared to ORC and Parquet (Iceberg has native support for
>>> these formats).
>>> >
>>> > My suggestion for conversion would be to do a CTAS statement in Spark
>>> and have the table completely converted over to Parquet (or ORC). This is
>>> probably the simplest way.
>>> >
>>> >> On Sep 29, 2021, at 7:01 AM, yuan youjun <yuanyou...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi community,
>>> >>
>>> >> I am exploring ways to evolute existing hive tables (RCFile)  into
>>> data lake. However I found out that iceberg (or Hudi, delta lake) does not
>>> support RCFile. So my questions are:
>>> >> 1, is there any plan (or is it possible) to support RCFile in the
>>> future? So we can manage those existing data file without re-formating.
>>> >> 2, If no such plan, do you have any suggestion to migrate RCFiles
>>> into iceberg?
>>> >>
>>> >> Thanks
>>> >> Youjun
>>>
>>>
>>>
>
> --
> Ryan Blue
> Tabular
>
>
>

Re: support of RCFile

Reply via email to