Have you test this on a smaller set to verify that the query is correct? On Mon, Jun 22, 2015 at 2:59 PM, ayan guha <[email protected]> wrote:
> You may also want to change count(*) to specific column. > On 23 Jun 2015 01:29, "James Aley" <[email protected]> wrote: > >> Hello, >> >> A colleague of mine ran the following Spark SQL query: >> >> select >> count(*) as uses, >> count (distinct cast(id as string)) as users >> from usage_events >> where >> from_unixtime(cast(timestamp_millis/1000 as bigint)) >> between '2015-06-09' and '2015-06-16' >> >> The table contains billions of rows, but totals only 64GB of data across >> ~30 separate files, which are stored as Parquet with LZO compression in S3. >> >> From the referenced columns: >> >> * id is Binary, which we cast to a String so that we can DISTINCT by it. >> (I was already told this will improve in a later release, in a separate >> thread.) >> * timestamp_millis is a long, containing a unix timestamp with >> millisecond resolution >> >> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2 >> instances, using 20 executors, each with 4GB memory. I can see from >> monitoring tools that the CPU usage is at 100% on all nodes, but incoming >> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound. >> >> Does that seem slow? Can anyone offer any ideas by glancing at the query >> as to why this might be slow? We'll profile it meanwhile and post back if >> we find anything ourselves. >> >> A side issue - I've found that this query, and others, sometimes >> completes but doesn't return any results. There appears to be no error that >> I can see in the logs, and Spark reports the job as successful, but the >> connected JDBC client (SQLWorkbenchJ in this case), just sits there forever >> waiting. I did a quick Google and couldn't find anyone else having similar >> issues. >> >> >> Many thanks, >> >> James. >> >
