Re: support of RCFile

yuan youjun Wed, 29 Sep 2021 21:49:03 -0700
that’s exactly what we need.

> 2021年9月30日 上午9:58，Jacques Nadeau <jacquesnad...@gmail.com> 写道：
> 
> I actually wonder if file formats should be an extension api so someone can 
> implement a file format but it without any changes in Iceberg core (I don't 
> think this is possible today). Let's say one wanted to create a proprietary 
> format but use Iceberg semantics (not me). Could we make it such that one 
> could do so by building an extension and leveraging off-the-shelf Iceberg? 
> That's seems the best option for something like RC file. For sure people are 
> going to have a desire to add new formats and given the pain of rewriting 
> large datasets but I'd hate to see lots of partially implemented file formats 
> in Iceberg proper. Better for people to build against an extension api and 
> have them serve the purposes they need. Maybe go so far as the extension api 
> only allows read, not write so that people don't do crazy things...
> 
> 
> On Wed, Sep 29, 2021 at 6:43 PM yuan youjun <yuanyou...@gmail.com 
> <mailto:yuanyou...@gmail.com>> wrote:
> Hi Ryan and Russell
> 
> Thanks very much for your response.
> 
> well, I want ACID and row level update capability that icegerg provides. I 
> believe data lake is a better way to manage our dataset, instead of hive.
> I also want our transition from hive to data lake is as smooth as possible, 
> which means:
> 1, the transition should be transparent to consumers (dashboard, data 
> scientist, downstream pipelines). If we start a new table with iceberg with 
> new data, then those consumers will NOT be able to query old data (without 
> splitting their queries into two, and combine the result).
> 2, do not impose significant infra cost. Convert historical data from RCFile 
> into ORC or Parquet would be time consuming and costly (though it’s one time 
> cost). I got your point that new format probably save our storage cost in the 
> long term, this would be a separate interesting topic.
> 
> Here is what in my mind now:
> 1, if iceberg support (or will support) legacy format, that would be ideal.
> 2, if not, is it possible for us to develop that feature (may be in a fork).
> 3, convert history data into new format should be our last sort, this way we 
> need more evaluation.
> 
> 
> youjun
> 
>> 2021年9月30日 上午12:15，Ryan Blue <b...@tabular.io <mailto:b...@tabular.io>> 写道：
>> 
>> Youjun, what are you trying to do?
>> 
>> If you have existing tables in an incompatible format, you may just want to 
>> leave them as they are for historical data. It depends on why you want to 
>> use Iceberg. If you want to be able to query larger ranges of that data 
>> because you've clustered across files by filter columns, then you'd want to 
>> build the Iceberg metadata. But if you have a lot of historical data that 
>> hasn't been clustered and is unlikely to be rewritten, then keeping old 
>> tables in RCFile and doing new work in Iceberg could be a better option.
>> 
>> You may also want to check how much savings you get out of using Iceberg 
>> with Parquet files vs RCFile. If you find that you can cluster your data for 
>> better queries and that ends up making your dataset considerably smaller 
>> then maybe it's worth the conversion that Russell suggested. RCFile is 
>> pretty old so I think there's a good chance you'd save a lot of space -- 
>> just updating from an old compression codec to something more modern like 
>> snappy to lz4 or gzip to zstd could be a big win.
>> 
>> Ryan
>> 
>> On Wed, Sep 29, 2021 at 8:49 AM Russell Spitzer <russell.spit...@gmail.com 
>> <mailto:russell.spit...@gmail.com>> wrote:
>> Within Iceberg it would take a bit of effort, we would need custom readers 
>> at the minimum if we just wanted to make it ReadOnly support. I think the 
>> main complexity would be designing the specific readers for the platform you 
>> want to use like "Spark" or "Flink", the actual metadata handling and such 
>> would probably be pretty straightforward. I would definitely size it as at 
>> least a several week project and I'm not sure we would want to support it in 
>> OSS Iceberg.
>> 
>> On Wed, Sep 29, 2021 at 10:40 AM 袁尤军 <wdyuanyou...@163.com 
>> <mailto:wdyuanyou...@163.com>> wrote:
>> thanks for the suggestion. we need to evaluate the cost to convert the 
>> format, as those hive tables  have been there for many years, so PB data 
>> need to reformat.
>> 
>> also, do you think it is possible to develop the support for a new format? 
>> how costly is it?
>> 
>> 发自我的iPhone
>> 
>> > 在 2021年9月29日，下午9:34，Russell Spitzer <russell.spit...@gmail.com 
>> > <mailto:russell.spit...@gmail.com>> 写道：
>> > 
>> > There is no plan I am aware of using RCFiles directly in Iceberg. While 
>> > we could work to support other file formats, I don't think it is very 
>> > widely used compared to ORC and Parquet (Iceberg has native support for 
>> > these formats).
>> > 
>> > My suggestion for conversion would be to do a CTAS statement in Spark and 
>> > have the table completely converted over to Parquet (or ORC). This is 
>> > probably the simplest way.
>> > 
>> >> On Sep 29, 2021, at 7:01 AM, yuan youjun <yuanyou...@gmail.com 
>> >> <mailto:yuanyou...@gmail.com>> wrote:
>> >> 
>> >> Hi community,
>> >> 
>> >> I am exploring ways to evolute existing hive tables (RCFile)  into data 
>> >> lake. However I found out that iceberg (or Hudi, delta lake) does not 
>> >> support RCFile. So my questions are:
>> >> 1, is there any plan (or is it possible) to support RCFile in the future? 
>> >> So we can manage those existing data file without re-formating.
>> >> 2, If no such plan, do you have any suggestion to migrate RCFiles into 
>> >> iceberg?
>> >> 
>> >> Thanks
>> >> Youjun
>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Tabular
>
Re: support of RCFile

Reply via email to