subject:"Re\: Efficient sampling from a Hive table"

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak

Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes to discover all split files on all hosts (for some reason) before it even starts the job, and then it creates 3.5 million tasks (the partition has ~32k split files). On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke wrote: >

Re: Efficient sampling from a Hive table

2015-08-26 Thread Jörn Franke

Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want Le mer. 26 août 2015 à 18:12, Thomas Dudziak a écrit : > Sorry, I meant without reading from all splits. This is a single partition > in the table. > > On Wed, Aug 26, 2015 at 8:53 AM,

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak

Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak wrote: > I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from > and I don't particularly care which rows. Doing a LIMIT unfortunately > r