Re: HiveContext fails when querying large external Parquet tables

Andrew Otto Fri, 22 May 2015 13:37:43 -0700

What is also strange is that this seems to work on external JSON data, but not 
Parquet.  I’ll try to do more verification of that next week.



> On May 22, 2015, at 16:24, yana <yana.kadiy...@gmail.com> wrote:
> 
> There is an open Jira on Spark not pushing predicates to metastore. I have a 
> large dataset with many partitions but doing anything with it 8s very 
> slow...But I am surprised Spark 1.2 worked for you: it has this problem...
> 
> -------- Original message --------
> From: Andrew Otto
> Date:05/22/2015 3:51 PM (GMT-05:00)
> To: user@spark.apache.org
> Cc: Joseph Allemandou ,Madhumitha Viswanathan
> Subject: HiveContext fails when querying large external Parquet tables
> 
> Hi all,
> 
> (This email was easier to write in markdown, so I’ve created a gist with its 
> contents here: https://gist.github.com/ottomata/f91ea76cece97444e269 
> <https://gist.github.com/ottomata/f91ea76cece97444e269>.  I’ll paste the 
> markdown content in the email body here too.)
> 
> ---
> We’ve recently upgraded to CDH 5.4.0 which comes with Spark 1.3.0 and Hive 
> 1.1.0.  Previously we were on CDH 5.3.x, running Spark 1.2.0 and Hive 0.13.0. 
>  Since upgrading, we can no longer query our large webrequest dataset using 
> HiveContext.  HiveContext + Parquet and other file types work fine with 
> external tables (We have a similarly large JSON external table that works 
> just fine with HiveContext.)
> 
> Our webrequest dataset is stored in hourly partitioned Parquet files.  We 
> mainly interact with this dataset via a Hive external table, but also have 
> been using Spark's HiveContext.
> 
> ```
> # This single hourly directory is only 5.3M
> $ hdfs dfs -du -s -h 
> /wmf/data/wmf/webrequest/webrequest_source=misc/year=2015/month=5/day=20/hour=0
> 5.3 M  15.8 M  
> /wmf/data/wmf/webrequest/webrequest_source=misc/year=2015/month=5/day=20/hour=0
> 
> # This monthly directory is 1.8T.  (There are subdirectories down to hourly 
> level here too.)
> $ hdfs dfs -du -s -h 
> /wmf/data/wmf/webrequest/webrequest_source=bits/year=2015/month=5
> 1.8 T  5.3 T  
> /wmf/data/wmf/webrequest/webrequest_source=bits/year=2015/month=5
> ```
> 
> If I create a Hive table on top of this data, and add the single hourly 
> partition, querying works via both Hive and Spark HiveContext
> 
> ```sql
> hive (otto)> CREATE EXTERNAL TABLE IF NOT EXISTS 
> `otto.webrequest_few_partitions_big_data`(
>     `hostname`          string  COMMENT 'Source node hostname',
>     `sequence`          bigint  COMMENT 'Per host sequence number',
>     `dt`                string  COMMENT 'Timestame at cache in ISO 8601',
>     `time_firstbyte`    double  COMMENT 'Time to first byte',
>     `ip`                string  COMMENT 'IP of packet at cache',
>     `cache_status`      string  COMMENT 'Cache status',
>     `http_status`       string  COMMENT 'HTTP status of response',
>     `response_size`     bigint  COMMENT 'Response size',
>     `http_method`       string  COMMENT 'HTTP method of request',
>     `uri_host`          string  COMMENT 'Host of request',
>     `uri_path`          string  COMMENT 'Path of request',
>     `uri_query`         string  COMMENT 'Query of request',
>     `content_type`      string  COMMENT 'Content-Type header of response',
>     `referer`           string  COMMENT 'Referer header of request',
>     `x_forwarded_for`   string  COMMENT 'X-Forwarded-For header of request',
>     `user_agent`        string  COMMENT 'User-Agent header of request',
>     `accept_language`   string  COMMENT 'Accept-Language header of request',
>     `x_analytics`       string  COMMENT 'X-Analytics header of response',
>     `range`             string  COMMENT 'Range header of response',
>     `is_pageview`       boolean COMMENT 'Indicates if this record was marked 
> as a pageview during refinement',
>     `record_version`    string  COMMENT 'Keeps track of changes in the table 
> content definition - 
> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest' 
> <https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest'>,
>     `client_ip`         string  COMMENT 'Client IP computed during refinement 
> using ip and x_forwarded_for',
>     `geocoded_data`     map<string, string>  COMMENT 'Geocoded map with 
> continent, country_code, country, city, subdivision, postal_code, latitude, 
> longitude, timezone keys  and associated values.',
>     `x_cache`           string  COMMENT 'X-Cache header of response',
>     `user_agent_map`    map<string, string>  COMMENT 'User-agent map with 
> browser_name, browser_major, device, os_name, os_minor, os_major keys and 
> associated values',
>     `x_analytics_map`   map<string, string>  COMMENT 'X_analytics map view of 
> the x_analytics field',
>     `ts`                timestamp            COMMENT 'Unix timestamp in 
> milliseconds extracted from dt',
>     `access_method`     string  COMMENT 'Method used to accessing the site 
> (mobile app|mobile web|desktop)',
>     `agent_type`        string  COMMENT 'Categorise the agent making the 
> webrequest as either user or spider (automatas to be added).',
>     `is_zero`           boolean COMMENT 'Indicates if the webrequest is 
> accessed through a zero provider',
>     `referer_class`     string  COMMENT 'Indicates if a referer is internal, 
> external or unknown.'
> )
> PARTITIONED BY (
>     `webrequest_source` string  COMMENT 'Source cluster',
>     `year`              int     COMMENT 'Unpadded year of request',
>     `month`             int     COMMENT 'Unpadded month of request',
>     `day`               int     COMMENT 'Unpadded day of request',
>     `hour`              int     COMMENT 'Unpadded hour of request'
> )
> CLUSTERED BY(hostname, sequence) INTO 64 BUCKETS
> STORED AS PARQUET
> LOCATION '/wmf/data/wmf/webrequest'
> ;
> 
> hive (otto)> alter table otto.webrequest_few_partitions_big_data add 
> partition (webrequest_source='misc', year=2015, month=5, day=20, hour=0) 
> location 
> '/wmf/data/wmf/webrequest/webrequest_source=misc/year=2015/month=5/day=20/hour=0';
> 
> hive (otto)> SELECT uri_host, uri_path from 
> otto.webrequest_few_partitions_big_data where webrequest_source='misc' and 
> year=2015 and month=5 and day=20 and hour=0 limit 4;
> OK
> uri_host      uri_path
> graphite.wikimedia.org <http://graphite.wikimedia.org/>       /render
> graphite.wikimedia.org <http://graphite.wikimedia.org/>       /render
> stats.wikimedia.org <http://stats.wikimedia.org/>     
> /EN/PlotTotalArticlesBS.png
> etherpad.wikimedia.org <http://etherpad.wikimedia.org/>       
> /socket.io/1/xhr-polling/nuoaggODBKYWJ4sCvfuZ
> ```
> 
> ```scala
> $ spark-shell
> 
> scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc)
> 
> scala> val query = "SELECT uri_host, uri_path from 
> otto.webrequest_few_partitions_big_data where webrequest_source='misc' and 
> year=2015 and month=5 and day=20 and hour=0 limit 4"
> 
> scala> hc.sql(query).collect().foreach(println)
> [graphite.wikimedia.org <http://graphite.wikimedia.org/>,/render]
> [graphite.wikimedia.org <http://graphite.wikimedia.org/>,/render]
> [stats.wikimedia.org 
> <http://stats.wikimedia.org/>,/EN/PlotTotalArticlesBS.png]
> [etherpad.wikimedia.org 
> <http://etherpad.wikimedia.org/>,/socket.io/1/xhr-polling/nuoaggODBKYWJ4sCvfuZ]
> 
> ```
> 
> But, when I add more data, Spark either OOMs, or results in Premature EOF (if 
> I bump up memory).  I've tried this with many  hourly partitions, or just two 
> partitions one with lots of data.  In either case the same amount of data was 
> present.  For this example I'll just add one large partition to the table.
> 
> ```sql
> -- Of course day=0, hour=25 doesn't make sense, but I'm including all
> -- of May in this one paritition, so I just chose dummy partition keys
> hive (otto)> alter table otto.webrequest_few_partitions_big_data add 
> partition (webrequest_source='bits', year=2015, month=5, day=0, hour=25) 
> location '/wmf/data/wmf/webrequest/webrequest_source=bits/year=2015/month=5';
> 
> hive (otto)> show partitions otto.webrequest_few_partitions_big_data;
> OK
> partition
> webrequest_source=bits/year=2015/month=5/day=0/hour=25
> webrequest_source=misc/year=2015/month=5/day=20/hour=0
> 
> hive (otto)> SELECT uri_host, uri_path from 
> otto.webrequest_few_partitions_big_data where webrequest_source='misc' and 
> year=2015 and month=5 and day=20 and hour=0 limit 4;
> OK
> uri_host      uri_path
> graphite.wikimedia.org <http://graphite.wikimedia.org/>       /render
> graphite.wikimedia.org <http://graphite.wikimedia.org/>       /render
> stats.wikimedia.org <http://stats.wikimedia.org/>     
> /EN/PlotTotalArticlesBS.png
> etherpad.wikimedia.org <http://etherpad.wikimedia.org/>       
> /socket.io/1/xhr-polling/nuoaggODBKYWJ4sCvfuZ
> ```
> 
> Now, let's try the same query we did before in spark-shell.
> 
> ```scala
> $ spark-shell
> 
> Spark context available as sc.
> 15/05/22 19:04:25 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 
> scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc)
> hc: org.apache.spark.sql.hive.HiveContext = 
> org.apache.spark.sql.hive.HiveContext@38fe4d15
> 
> scala> val query = "SELECT uri_host, uri_path from 
> otto.webrequest_few_partitions_big_data where webrequest_source='misc' and 
> year=2015 and month=5 and day=20 and hour=0 limit 4"
> query: String = SELECT uri_host, uri_path from 
> otto.webrequest_few_partitions_big_data where webrequest_source='misc' and 
> year=2015 and month=5 and day=20 and hour=0 limit 4
> 
> scala> hc.sql(query).collect().foreach(println)
> 15/05/22 19:04:42 INFO metastore: Trying to connect to metastore with URI 
> thrift://analytics1027.eqiad.wmnet:9083 
> <thrift://analytics1027.eqiad.wmnet:9083>
> 15/05/22 19:04:42 INFO metastore: Connected to metastore.
> 15/05/22 19:04:43 INFO SessionState: Created local directory: 
> /tmp/fcb5ac90-78c7-4368-adac-085d7935fd6b_resources
> 15/05/22 19:04:43 INFO SessionState: Created HDFS directory: 
> /tmp/hive/otto/fcb5ac90-78c7-4368-adac-085d7935fd6b
> 15/05/22 19:04:43 INFO SessionState: Created local directory: 
> /tmp/otto/fcb5ac90-78c7-4368-adac-085d7935fd6b
> 15/05/22 19:04:43 INFO SessionState: Created HDFS directory: 
> /tmp/hive/otto/fcb5ac90-78c7-4368-adac-085d7935fd6b/_tmp_space.db
> 15/05/22 19:04:43 INFO SessionState: No Tez session required at this point. 
> hive.execution.engine=mr.
> 15/05/22 19:04:44 INFO ParseDriver: Parsing command: SELECT uri_host, 
> uri_path from otto.webrequest_few_partitions_big_data where 
> webrequest_source='misc' and year=2015 and month=5 and day=20 and hour=0 
> limit 4
> 15/05/22 19:04:44 INFO ParseDriver: Parse Completed
> 
> ### Many minutes pass... ###
> 
> Exception in thread "qtp454555988-129" Exception in thread "IPC Client 
> (1610773823) connection to analytics1001.eqiad.wmnet/10.64.36.118:8020 from 
> otto" Exception in thread "qtp454555988-52"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 15/05/22 18:49:15 INFO RetryInvocationHandler: Exception while invoking 
> getBlockLocations of class ClientNamenodeProtocolTranslatorPB over 
> analytics1001.eqiad.wmnet/10.64.36.118:8020. Trying to fail over immediately.
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>       at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
>       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1211)
>       at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
>       at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1191)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:299)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:265)
>       at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:257)
>       at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1490)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:302)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)
>       at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
>       at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>       at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:411)
>       at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
>       at 
> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
>       at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>       at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>       at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>       at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>       at 
> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>       at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.popAndExecAll(ForkJoinPool.java:1243)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1344)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: 
> GC overhead limit exceeded
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:274)
>       at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254)
>       ... 36 more
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> ```
> 
> Note, I have also seen slightly different OOM errors on the CLI.   Sometimes 
> I see:
> ```
> 15/05/22 19:09:22 ERROR ActorSystemImpl: exception on LARS’ timer thread
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 15/05/22 19:09:29 INFO ActorSystemImpl: starting new LARS thread
> 15/05/22 19:09:29 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-scheduler-1] shutting down ActorSystem [sparkDriver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> Exception in thread "Driver Heartbeater" java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> 
> ...
> ```
> 
> 
> Ok, so let's increase the driver memory:
> 
> ```scala
> $ spark-shell --driver-memory 1500M
> Spark context available as sc.
> 15/05/22 19:09:07 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 
> scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc)
> 
> scala> val query = "SELECT uri_host, uri_path from 
> otto.webrequest_few_partitions_big_data where webrequest_source='misc' and 
> year=2015 and month=5 and day=20 and hour=0 limit 4"
> 
> scala> hc.sql(query).collect().foreach(println)
> 15/05/22 19:11:41 INFO metastore: Trying to connect to metastore with URI 
> thrift://analytics1027.eqiad.wmnet:9083 
> <thrift://analytics1027.eqiad.wmnet:9083>
> 15/05/22 19:11:41 INFO metastore: Connected to metastore.
> 15/05/22 19:11:42 INFO SessionState: Created local directory: 
> /tmp/5f8996fc-ce1a-4954-8c6c-94c291aa8c70_resources
> 15/05/22 19:11:42 INFO SessionState: Created HDFS directory: 
> /tmp/hive/otto/5f8996fc-ce1a-4954-8c6c-94c291aa8c70
> 15/05/22 19:11:42 INFO SessionState: Created local directory: 
> /tmp/otto/5f8996fc-ce1a-4954-8c6c-94c291aa8c70
> 15/05/22 19:11:42 INFO SessionState: Created HDFS directory: 
> /tmp/hive/otto/5f8996fc-ce1a-4954-8c6c-94c291aa8c70/_tmp_space.db
> 15/05/22 19:11:42 INFO SessionState: No Tez session required at this point. 
> hive.execution.engine=mr.
> 15/05/22 19:11:43 INFO ParseDriver: Parsing command: SELECT uri_host, 
> uri_path from otto.webrequest_few_partitions_big_data where 
> webrequest_source='misc' and year=2015 and month=5 and day=20 and hour=0 
> limit 4
> 15/05/22 19:11:44 INFO ParseDriver: Parse Completed
> 
> ### Almost 30 minutes pass... ###
> 
> java.io.EOFException: Premature EOF: no length prefix available
>       at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2212)
>       at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:408)
>       at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:796)
>       at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:674)
>       at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:621)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:847)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:897)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:700)
>       at java.io.FilterInputStream.read(FilterInputStream.java:83)
>       at parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:66)
>       at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423)
>       at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
>       at 
> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
>       at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>       at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>       at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>       at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>       at 
> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>       at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runSubtask(ForkJoinPool.java:1357)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.tryHelpStealer(ForkJoinPool.java:2253)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.awaitJoin(ForkJoinPool.java:2377)
>       at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
>       at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
>       at 
> scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:444)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:514)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:187)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>       at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>       at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> ```
> 
> I've tested this both in local mode and in YARN client mode, and both have 
> similar behavoirs.  What's worrysome is that the behavior is different after 
> adding more data to the table, even though I am querying the same very small 
> partition.  The whole point of Hive partitions is to allow jobs to work with 
> only the data that is needed.  I'm not sure what Spark  HiveContext is doing 
> here, but it seems to couple the full size of a Hive table to the performance 
> of a query that only needs a very small amount of data.
> 
> I poked around the Spark source, and for a minute thought this might be 
> related: https://github.com/apache/spark/commit/42389b17 
> <https://github.com/apache/spark/commit/42389b17>, but that was included in 
> Spark 1.2.0, and this was working for us fine.
> 
> Is HiveContext somehow trying to scan the whole table in the driver?  Has 
> anyone else had this problem?
> 
> Thanks!
> 
> -Andrew Otto
>

Re: HiveContext fails when querying large external Parquet tables

Reply via email to