We're reading a parquet file (550m records). We want to split the parquet using a filter in to 2 sets, live and dead.
DataSet a = read parquet DataSet live = a.filter(liveFilter) DataSet dead = a.filter(deadFilter) Is slower than DataSet a = read parquet DataSet live = a.filter(liveFilter) DataSet b = read parquet DataSet dead = b.filter(deadFilter) Does this make sense? Why would reading it twice be quicker? We're using 1.1.2 Billy Newport Data Architecture, Goldman, Sachs & Co. 30 Hudson | 37th Floor | Jersey City, NJ Tel: +1 (212) 8557773 | Cell: +1 (507) 254-0134 Email: billy.newp...@gs.com<mailto:edward.new...@gs.com>, KD2DKQ