thanks for the information. Up-to-date hive. Cluster on the smallish side. And, well, sure looks like a memory issue. :) rather than an inherent hive limitation that is.
So. I can only speak as a user (ie. not a hive developer) but what i'd be interested in knowing next is is this via running hive in local mode, correct? (eg. not through hiveserver1/2). And it looks like it boinks on array processing which i assume to be internal code arrays and not hive data arrays - your 15K columns are all scalar/simple types, correct? Its clearly fetching results and looks be trying to store them in a java array - and not just one row but a *set* of rows (ArrayList) two things to try. 1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE is the controller of that. I woulda hoped it was called something like "HIVE_HEAPSIZE". :) Anyway, can't hurt to try. 2. trim down the number of columns and see where the breaking point is. is it 10K? is it 5K? The idea is to confirm its _the number of columns_ that is causing the memory to blow and not some other artifact unbeknownst to us. 3. Google around the Hive namespace for something that might limit or otherwise control the number of rows stored at once in Hive's internal buffer. I snoop around too. That's all i got for now and maybe we'll get lucky and someone on this list will know something or another about this. :) cheers, Stephen. On Thu, Jan 30, 2014 at 2:32 AM, David Gayou <david.ga...@kxen.com> wrote: > > We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0 or > hive 0.10.0 > Our hadoop version is 1.1.2. > Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU (with > hyperthreading so 4 cores per machine) + 16Gb Ram each > > The error message i get is : > > 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction > (ProcessFunction.java:process(41)) - Internal error processing FetchResults > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2734) > at java.util.ArrayList.ensureCapacity(ArrayList.java:167) > at java.util.ArrayList.add(ArrayList.java:351) > at org.apache.hive.service.cli.Row.<init>(Row.java:47) > at org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61) > at > org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358) > at > org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) > at > org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) > at > org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > > My HADOOP_HEAPSIZE is setted to 4096 in hive-env.sh > > > We are doing some machine learning on row by row basis on those dataset, > so basically the more column we have the better it is. > > We are coming from the SQL world, and Hive is the closest to SQL syntax. > We'd like to keep some SQL manipulation on the data. > > Thanks for the Help, > > Regards, > > David Gayou > > On Tue, Jan 28, 2014 at 8:35 PM, Stephen Sprague <sprag...@gmail.com>wrote: > >> there's always a use case out there that stretches the imagination isn't >> there? gotta love it. >> >> first things first. can you share the error message? the hive version? >> and the number of nodes in your cluster? >> >> then a couple of things come to my mind. Might you consider pivoting >> the data such that you represent one row of 15K columns as 15K rows as, >> say, 3 columns (id, column_name, column_value) before you even load it into >> hive? >> >> the other thing is when i hear 15K columns the first thing i think is >> HBase (their motto is millions of columns and billions of rows) >> >> Anyway, lets see what you got for the first question! :) >> >> cheers, >> Stephen. >> >> >> On Tue, Jan 28, 2014 at 3:20 AM, David Gayou <david.ga...@kxen.com>wrote: >> >>> Hello, >>> >>> I'm trying to test Hive with Tables including quite a lot of Columns. >>> >>> We are using the data from the KDD Cup 2009 based on anonymised real >>> case dataset. >>> http://www.sigkdd.org/kdd-cup-2009-customer-relationship-prediction >>> >>> The aim is to be able to create and manipulate a table with 15,000 >>> columns. >>> >>> We were actually able to create the table and to load data inside it. >>> You can find the create statement inside the attached file. >>> The data file is pretty big, but i can share it if anyone want it. >>> >>> >>> The statement >>> SELECT * FROM orange_large_train_3 LIMIT 1000 >>> is working fine, >>> >>> But the >>> SELECT * FROM orange_large_train_3 >>> doesn't work. >>> >>> >>> We have tried several options for creating tables including creating the >>> table using the ColumnarSerde row format, but couldn't make it works. >>> >>> Does any of you have any server configuration or storage to use when >>> creating table >>> in order to make it works with such a number of columns ? >>> >>> >>> >>> Regards, >>> >>> David Gayou >>> >> >> >