> On 30 Apr 2017, at 09:19, Zeming Yu wrote:
>
> Hi,
>
> We're building a parquet based data lake. I was under the impression that
> flat files are more efficient than deeply nested files (say 3 or 4 levels
> down). Is that correct?
>
> Thanks,
> Zeming
Where's the data going to live: HDFS
Can you give more details on the schema? Is it 6 TB just airport information as
below?
> On 30. Apr 2017, at 23:08, Zeming Yu wrote:
>
> I thought relational databases with 6 TB of data can be quite expensive?
>
>> On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote:
>> I am not sure if parquet
You have to find out how the user filters - by code? By airport name? Then you
can have the right structure. Although, in the scenario below ORC with bloom
filters may have some advantages.
It is crucial that you sort the data when inserting it on the columns your user
wants to filter. E.g. If f
I thought relational databases with 6 TB of data can be quite expensive?
On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote:
> I am not sure if parquet is a good fit for this? This seems more like
> filter lookup than an aggregate like query. I am curious to see what others
> have to say.
> Would i
Another question: I need to store airport info in a parquet file and
present it when a user makes a query.
For example:
"airport": {
"code": "TPE",
"name": "Taipei (Taoyuan Intl.)",
Depends on your queries, the data structure etc. generally flat is better, but
if your query filter is on the highest level then you may have better
performance with a nested structure, but it really depends
> On 30. Apr 2017, at 10:19, Zeming Yu wrote:
>
> Hi,
>
> We're building a parquet ba
Hi,
We're building a parquet based data lake. I was under the impression that
flat files are more efficient than deeply nested files (say 3 or 4 levels
down). Is that correct?
Thanks,
Zeming