that’s exactly what we need.
> 2021年9月30日 上午9:58,Jacques Nadeau <jacquesnad...@gmail.com> 写道:
>
> I actually wonder if file formats should be an extension api so someone can
> implement a file format but it without any changes in Iceberg core (I don't
> think this is possible today). Let's say one wanted to create a proprietary
> format but use Iceberg semantics (not me). Could we make it such that one
> could do so by building an extension and leveraging off-the-shelf Iceberg?
> That's seems the best option for something like RC file. For sure people are
> going to have a desire to add new formats and given the pain of rewriting
> large datasets but I'd hate to see lots of partially implemented file formats
> in Iceberg proper. Better for people to build against an extension api and
> have them serve the purposes they need. Maybe go so far as the extension api
> only allows read, not write so that people don't do crazy things...
>
>
> On Wed, Sep 29, 2021 at 6:43 PM yuan youjun <yuanyou...@gmail.com
> <mailto:yuanyou...@gmail.com>> wrote:
> Hi Ryan and Russell
>
> Thanks very much for your response.
>
> well, I want ACID and row level update capability that icegerg provides. I
> believe data lake is a better way to manage our dataset, instead of hive.
> I also want our transition from hive to data lake is as smooth as possible,
> which means:
> 1, the transition should be transparent to consumers (dashboard, data
> scientist, downstream pipelines). If we start a new table with iceberg with
> new data, then those consumers will NOT be able to query old data (without
> splitting their queries into two, and combine the result).
> 2, do not impose significant infra cost. Convert historical data from RCFile
> into ORC or Parquet would be time consuming and costly (though it’s one time
> cost). I got your point that new format probably save our storage cost in the
> long term, this would be a separate interesting topic.
>
> Here is what in my mind now:
> 1, if iceberg support (or will support) legacy format, that would be ideal.
> 2, if not, is it possible for us to develop that feature (may be in a fork).
> 3, convert history data into new format should be our last sort, this way we
> need more evaluation.
>
>
> youjun
>
>> 2021年9月30日 上午12:15,Ryan Blue <b...@tabular.io <mailto:b...@tabular.io>> 写道:
>>
>> Youjun, what are you trying to do?
>>
>> If you have existing tables in an incompatible format, you may just want to
>> leave them as they are for historical data. It depends on why you want to
>> use Iceberg. If you want to be able to query larger ranges of that data
>> because you've clustered across files by filter columns, then you'd want to
>> build the Iceberg metadata. But if you have a lot of historical data that
>> hasn't been clustered and is unlikely to be rewritten, then keeping old
>> tables in RCFile and doing new work in Iceberg could be a better option.
>>
>> You may also want to check how much savings you get out of using Iceberg
>> with Parquet files vs RCFile. If you find that you can cluster your data for
>> better queries and that ends up making your dataset considerably smaller
>> then maybe it's worth the conversion that Russell suggested. RCFile is
>> pretty old so I think there's a good chance you'd save a lot of space --
>> just updating from an old compression codec to something more modern like
>> snappy to lz4 or gzip to zstd could be a big win.
>>
>> Ryan
>>
>> On Wed, Sep 29, 2021 at 8:49 AM Russell Spitzer <russell.spit...@gmail.com
>> <mailto:russell.spit...@gmail.com>> wrote:
>> Within Iceberg it would take a bit of effort, we would need custom readers
>> at the minimum if we just wanted to make it ReadOnly support. I think the
>> main complexity would be designing the specific readers for the platform you
>> want to use like "Spark" or "Flink", the actual metadata handling and such
>> would probably be pretty straightforward. I would definitely size it as at
>> least a several week project and I'm not sure we would want to support it in
>> OSS Iceberg.
>>
>> On Wed, Sep 29, 2021 at 10:40 AM 袁尤军 <wdyuanyou...@163.com
>> <mailto:wdyuanyou...@163.com>> wrote:
>> thanks for the suggestion. we need to evaluate the cost to convert the
>> format, as those hive tables have been there for many years, so PB data
>> need to reformat.
>>
>> also, do you think it is possible to develop the support for a new format?
>> how costly is it?
>>
>> 发自我的iPhone
>>
>> > 在 2021年9月29日,下午9:34,Russell Spitzer <russell.spit...@gmail.com
>> > <mailto:russell.spit...@gmail.com>> 写道:
>> >
>> > There is no plan I am aware of using RCFiles directly in Iceberg. While
>> > we could work to support other file formats, I don't think it is very
>> > widely used compared to ORC and Parquet (Iceberg has native support for
>> > these formats).
>> >
>> > My suggestion for conversion would be to do a CTAS statement in Spark and
>> > have the table completely converted over to Parquet (or ORC). This is
>> > probably the simplest way.
>> >
>> >> On Sep 29, 2021, at 7:01 AM, yuan youjun <yuanyou...@gmail.com
>> >> <mailto:yuanyou...@gmail.com>> wrote:
>> >>
>> >> Hi community,
>> >>
>> >> I am exploring ways to evolute existing hive tables (RCFile) into data
>> >> lake. However I found out that iceberg (or Hudi, delta lake) does not
>> >> support RCFile. So my questions are:
>> >> 1, is there any plan (or is it possible) to support RCFile in the future?
>> >> So we can manage those existing data file without re-formating.
>> >> 2, If no such plan, do you have any suggestion to migrate RCFiles into
>> >> iceberg?
>> >>
>> >> Thanks
>> >> Youjun
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>