Re: parquet optimal file structure - flat vs nested

Jörn Franke Sun, 30 Apr 2017 14:47:10 -0700

Can you give more details on the schema? Is it 6 TB just airport information as 
below?


> On 30. Apr 2017, at 23:08, Zeming Yu <zemin...@gmail.com> wrote:
> 
> I thought relational databases with 6 TB of data can be quite expensive?
> 
>> On 1 May 2017 12:56 am, "Muthu Jayakumar" <bablo...@gmail.com> wrote:
>> I am not sure if parquet is a good fit for this? This seems more like filter 
>> lookup than an aggregate like query. I am curious to see what others have to 
>> say.
>> Would it be more efficient if a relational database with the right index 
>> (code field in the above case) to perform more efficiently (with spark that 
>> uses predicate push-down)? 
>> Hope this helps.
>> 
>> Thanks,
>> Muthu
>> 
>>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu <zemin...@gmail.com> wrote:
>>> Another question: I need to store airport info in a parquet file and 
>>> present it when a user makes a query. 
>>> 
>>> For example:
>>> 
>>> "airport": {
>>>                                         "code": "TPE",
>>>                                         "name": "Taipei (Taoyuan Intl.)",
>>>                                         "longName": "Taipei, Taiwan 
>>> (TPE-Taoyuan Intl.)",
>>>                                         "city": "Taipei",
>>>                                         "localName": "Taoyuan Intl.",
>>>                                         "airportCityState": "Taipei, Taiwan"
>>> 
>>> 
>>> Is it best practice to store just the coce "TPE" and then look up the name 
>>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>> 
>>>> On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> Depends on your queries, the data structure etc. generally flat is better, 
>>>> but if your query filter is on the highest level then you may have better 
>>>> performance with a nested structure, but it really depends
>>>> 
>>>> > On 30. Apr 2017, at 10:19, Zeming Yu <zemin...@gmail.com> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > We're building a parquet based data lake. I was under the impression 
>>>> > that flat files are more efficient than deeply nested files (say 3 or 4 
>>>> > levels down). Is that correct?
>>>> >
>>>> > Thanks,
>>>> > Zeming
>>> 
>>

Re: parquet optimal file structure - flat vs nested

Reply via email to