Schema Evolution for nested Dataset[T]

2017-04-30 Thread Mike Wheeler
Hi Spark Users, Suppose I have some data (stored in parquet for example) generated as below: package com.company.entity.old case class Course(id: Int, students: List[Student]) case class Student(name: String) Then usually I can access the data by spark.read.parquet("data.parquet").as[Course] N

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Can you give more details on the schema? Is it 6 TB just airport information as below? > On 30. Apr 2017, at 23:08, Zeming Yu wrote: > > I thought relational databases with 6 TB of data can be quite expensive? > >> On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote: >> I am not sure if parquet

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
You have to find out how the user filters - by code? By airport name? Then you can have the right structure. Although, in the scenario below ORC with bloom filters may have some advantages. It is crucial that you sort the data when inserting it on the columns your user wants to filter. E.g. If f

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
I thought relational databases with 6 TB of data can be quite expensive? On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote: > I am not sure if parquet is a good fit for this? This seems more like > filter lookup than an aggregate like query. I am curious to see what others > have to say. > Would i

examples of dealing with nested parquet/ dataframe file

2017-04-30 Thread Zeming Yu
Hi, I'm still trying to decide whether to store my data as deeply nested or flat parquet file. The main reason for storing the nested file is it stores data in its raw format, no information loss. I have two questions: 1. Is it always necessary to flatten a nested dataframe for the purpose of b

Re: Recommended cluster parameters

2017-04-30 Thread Zeming Yu
I've got a similar question. Would you be able to provide some rough guide (even a range is fine) on the number of nodes, cores, and total amount of RAM required? Do you want to store 1 TB, 1 PB or far more? - say 6 TB of data in parquet format on s3 Do you want to just read that data, retrieve

Re: Recommended cluster parameters

2017-04-30 Thread yohann jardin
It really depends on your needs and your data. Do you want to store 1 TB, 1 PB or far more? Do you want to just read that data, retrieve it then do little work on it and then read it, have a complex machine learning pipeline? Depending on the workload, the ratio between cores and storage will

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Another question: I need to store airport info in a parquet file and present it when a user makes a query. For example: "airport": { "code": "TPE", "name": "Taipei (Taoyuan Intl.)",

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Depends on your queries, the data structure etc. generally flat is better, but if your query filter is on the highest level then you may have better performance with a nested structure, but it really depends > On 30. Apr 2017, at 10:19, Zeming Yu wrote: > > Hi, > > We're building a parquet ba

Recommended cluster parameters

2017-04-30 Thread rakesh sharma
Hi I would like to know the details of implementing a cluster. What kind of machines one would require, how many nodes, number of cores etc. thanks rakesh

parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Hi, We're building a parquet based data lake. I was under the impression that flat files are more efficient than deeply nested files (say 3 or 4 levels down). Is that correct? Thanks, Zeming

Spark repartition question...

2017-04-30 Thread Muthu Jayakumar
Hello there, I am trying to understand the difference between the following reparition()... a. def repartition(partitionExprs: Column*): Dataset[T] b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] c. def repartition(numPartitions: Int): Dataset[T] My understanding is th