Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable {
> private int id;
> private String name;
> private double salary;
> ....................
> }
>
Apply a schema to an RDD and register table.
JavaRDD<Person> rdds = ...
> rdds.cache();
>
> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
> dataFrame.registerTempTable("person");
>
> sqlContext.cacheTable("person");
>
Run sql query.
sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >= YYY AND
> salary <= XXX").collectAsList()
>
I launch standalone cluster which contains 4 workers. Each node runs on
machine with 8 CPU and 15 Gb memory. When I run the query on the
environment over RDD which contains 1000000 it takes 1 minute. Somebody can
tell me how to tuning the performance?