Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes to discover all split files on all hosts (for some reason) before it even starts the job, and then it creates 3.5 million tasks (the partition has ~32k split files). On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke wrote: >

Re: Efficient sampling from a Hive table

2015-08-26 Thread Jörn Franke
Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want Le mer. 26 août 2015 à 18:12, Thomas Dudziak a écrit : > Sorry, I meant without reading from all splits. This is a single partition > in the table. > > On Wed, Aug 26, 2015 at 8:53 AM,

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak wrote: > I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from > and I don't particularly care which rows. Doing a LIMIT unfortunately > r

Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT unfortunately results in two stages where the first stage reads the whole table, and the second then performs the limit with a single worker, which is not very efficien