Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes
to discover all split files on all hosts (for some reason) before it even
starts the job, and then it creates 3.5 million tasks (the partition has
~32k split files).
On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke wrote:
>
Have you tried tablesample? You find the exact syntax in the documentation,
but it exlxactly does what you want
Le mer. 26 août 2015 à 18:12, Thomas Dudziak a écrit :
> Sorry, I meant without reading from all splits. This is a single partition
> in the table.
>
> On Wed, Aug 26, 2015 at 8:53 AM,
Sorry, I meant without reading from all splits. This is a single partition
in the table.
On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak wrote:
> I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from
> and I don't particularly care which rows. Doing a LIMIT unfortunately
> r