We're reading a parquet file (550m records).

We want to split the parquet using a filter in to 2 sets, live and dead.

DataSet a = read parquet
DataSet live = a.filter(liveFilter)
DataSet dead = a.filter(deadFilter)

Is slower than

DataSet a = read parquet
DataSet live = a.filter(liveFilter)
DataSet b = read parquet
DataSet dead = b.filter(deadFilter)

Does this make sense? Why would reading it twice be quicker? We're using 1.1.2


Billy Newport
Data Architecture, Goldman, Sachs & Co.
30 Hudson | 37th Floor | Jersey City, NJ
Tel:  +1 (212) 8557773 |  Cell:  +1 (507) 254-0134
Email: billy.newp...@gs.com<mailto:edward.new...@gs.com>, KD2DKQ

Reply via email to