To follow on: I asked the developer how we incrementally load data and the response was
no. union only for updated records (every night) For every minutes export algorithm next: 1. upload file to hadoop. 2. load data inpath... overwrite into table ...._incremental; 3. insert into table ..._cached from ..._incremental Perhaps this helps understand our issue On Thursday, November 20, 2014, Gordon Benjamin <gordon.benjami...@gmail.com> wrote: > Hi, > > We are seeing bad performance as we incrementally load data. Here is the > config > > Spark standalone cluster > > spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's > spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's > spark03 (spark worker): 15GB RAM, 8vCPU's > spark04 (spark worker): 15GB RAM, 8vCPU's > > spark worker configuration: > spark.local.dir=/path/to/ssd/disk > spark.default.parallelism=64 > spark.executor.memory=10g > spark.serializer=org.apache.spark.serializer.KryoSerializer > > shark configuration: > spark.kryoserializer.buffer.mb=64 > mapred.reduce.tasks=30 > spark.scheduler.mode=FAIR > spark.serializer=org.apache.spark.serializer.KryoSerializer > spark.default.parallelism=64 > > and the performance decreases with more data being loaded into spark > > simple query like this: > select count(*) from customers_cached > 0.5 second on 12th Nov > 4.24 seconds now > > We have these errors all over the log > > 2014-11-20 16:56:42,125 WARN parse.TypeCheckProcFactory > (TypeCheckProcFactory.java:convert(180)) - Invalid type entry TOK_INT=null > 2014-11-20 16:56:51,988 WARN parse.TypeCheckProcFactory > (TypeCheckProcFactory.java:convert(180)) - Invalid type entry > TOK_TABLE_OR_COL=null > > Anyone any ideas to help us resolve this? Can post up anything you need > > > > >