Re: Difference between RC file format & Parquet file format

2016-02-18 Thread Koert Kuipers
ORC was created inside hive instead of (as it should have been i think) as a file format library that hive can depend on, and other frameworks as well. it seems to be part of hive's annoying tendency not to think of itself as a java library. On Thu, Feb 18, 2016 at 2:38 AM, Abhishek Dubey wrote:

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers
r the hive book I highlighted many smart capably organizations use hive. > > Your argument is totally valid. You like X better because X works for you. > You don't need to 'preach' hear we all know hive has it's limits. > > On Thu, Feb 4, 2016 at 10:55 AM, Koert

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers
; The reality is that once you start factoring in the numerous tuning > parameters of the systems and jobs there probably isn't a clear answer. > For some queries, the Catalyst optimizer may do a better job...is it going > to do a better job with ORC based data? less likely IMO. > >

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Koert Kuipers
t seems like hive > and tez made spark say uncle... > > https://www.slideshare.net/mobile/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final > > > On Wednesday, February 3, 2016, Koert Kuipers wrote: > >> ok i am sure there is some way to do it. i am going to guess s

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
available at all. so yeah yeah i am sure it can be made to *work*. just like you can get a nail into a wall with a screwdriver if you really want. On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers wrote: > yeah but have you ever seen somewhat write a real analytical program in > hive? how?

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
you are not boxed in by a long shot. > > > On Tuesday, February 2, 2016, Koert Kuipers wrote: > >> uuuhm with spark using Hive metastore you actually have a real >> programming environment and you can write real functions, versus just being >> boxed into some version of

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
tated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Koert Kuipers [mailto:ko...@tresata.com] > *Sent:* 03 F

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
uuuhm with spark using Hive metastore you actually have a real programming environment and you can write real functions, versus just being boxed into some version of sql and limited udfs? On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang wrote: > When comparing the performance, you need to do it apple

Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-01 Thread Koert Kuipers
touch them very much. Usually if they do change it >> is something small and if you tie the commit to a jira you can figure out >> what and why. >> >> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers wrote: >> >>> seems the metastore thrift service support SASL. t

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
version deployed. in that case i admit its not as bad as i thought. lets see! On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers wrote: > oh sorry edward, i misread you post. seems we agree that "SQL constructs > inside hive" are not for other systems. > > On Sat, Jan 31, 2015 at

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
oh sorry edward, i misread you post. seems we agree that "SQL constructs inside hive" are not for other systems. On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers wrote: > edward, > i would not call "SQL constructs inside hive" accessible for other > systems. its in

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
521269638&ref=pd_sl_4yiryvbf8k_e > there is even examples where I show how to iterate all the tables inside > the database from a java client. > > On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers wrote: > >> yes you can run whatever you like with the data in hdfs. keep in

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
by HBase) will be needed for that. On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers wrote: > yes you can run whatever you like with the data in hdfs. keep in mind that > hive makes this general access pattern just a little harder, since hive has > a tendency to store data and metadata s

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
ver paradigm they like? > > E.g.: GraphX, Mahout &etc. > > Also, what about Tajo or Drill? > > Best, > > Samuel Marks > http://linkedin.com/in/samuelmarks > > PS: Spark-SQL is read-only IIRC, right? > On 31 Jan 2015 03:39, "Koert Kuipers" wrote: >

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-30 Thread Koert Kuipers
since you require high-powered analytics, and i assume you want to stay sane while doing so, you require the ability to "drop out of sql" when needed. so spark-sql and lingual would be my choices. low latency indicates phoenix or spark-sql to me. so i would say spark-sql On Fri, Jan 30, 2015 at

SerDe loading external scheme

2012-04-05 Thread Koert Kuipers
I am working on a hive SerDe where both SerDe and RecordReader need to have access to an external resource with information. This external resource could be on hdfs, in hbase, or on a http server. This situation is very similar to what haivvreo does. The way i go about it right now is that i store

SerDe and InputFormat

2012-02-21 Thread Koert Kuipers
I make changes to the Configuration in my SerDe expecting those to be passed to the InputFormat (and OutputFormat). Yet the InputFormat seems to get an unchanged JobConf? Is this a known limitation? I find it very confusing since the Configuration is the main way to communicate with the MapReduce

2 questions about SerDe

2012-02-21 Thread Koert Kuipers
1) Is there a way in initialize() of a SerDe to know if it is being used as a Serializer or a Deserializer. If not, can i define the Serializer and Deserializer separately instead of defining a SerDe (so i have two initialize methods)? 2) Is there a way to find out which columns are being used? sa

external partitioned table

2012-02-08 Thread Koert Kuipers
hello all, we have an external partitioned table in hive. we add to this table by having map-reduce jobs (so not from hive) create new subdirectories with the right format (partitionid=partitionvalue). however hive doesn't pick them up automatically. we have to go into hive shell and run "alter

Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
Hello, >>> Have you tried running only select, without creating table? What are >>> results? >>> How did you tried to set number of reducers? Have you used this: >>> set mapred.reduce.tasks = xyz; >>> How many mappers does this query use? >>>

Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
/** > joinoptimization.html<https://cwiki.apache.org/Hive/joinoptimization.html> > > Especially try this setting: > set hive.auto.convert.join = true; (or false) > > Which version of Hive are you using? > > > On 13.01.2012 00:24, Koert Kuipers wrote: > >> hive>

Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
angiewicz wrote: > What do you mean by "Select runs fine" - is it using number of reducers > that you set? > It might help if you could show actual query. > > > On 13.01.2012 00:03, Koert Kuipers wrote: > >> I tried set mapred.reduce.tasks = xyz; hive ignored it.

Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
e you used this: >> set mapred.reduce.tasks = xyz; >> How many mappers does this query use? >> >> >> On 12.01.2012 23:53, Koert Kuipers wrote: >> >>> I am running a basic join of 2 tables and it will only run with 1 >>> reducer. >>> why is that?

Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
et number of reducers? Have you used this: > set mapred.reduce.tasks = xyz; > How many mappers does this query use? > > > On 12.01.2012 23:53, Koert Kuipers wrote: > >> I am running a basic join of 2 tables and it will only run with 1 reducer. >> why is that? i tried to set

why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
I am running a basic join of 2 tables and it will only run with 1 reducer. why is that? i tried to set the number of reducers and it didn't work. hive just ignored it. create table z as select x.* from table1 x join table2 y where ( x.col1 = y.col1 and x.col2 = y.col2 and x.col3 = y.col3 and x.col

Re: Do some basic operations on my hive warehouse from java

2011-10-19 Thread Koert Kuipers
l protocol = new TBinaryProtocol(transport); HiveClient client = new HiveClient(protocol); client.drop_table(database, table, true) Do you know of any similar interface for a local hive? Thanks On Wed, Oct 19, 2011 at 5:28 PM, Edward Capriolo wrote: > > > On Wed, Oct 19, 2011 at 5:18 PM, K

Re: authorization and remote connection (on cdh3u1)

2011-10-19 Thread Koert Kuipers
logged in? I thought the actual username was passed along in thrift if authorization was enabled, and that the actual username would be used for authorization. Am i wrong about this? On Wed, Oct 19, 2011 at 6:01 PM, Koert Kuipers wrote: > Using a normal hive connection and authorization it se

authorization and remote connection (on cdh3u1)

2011-10-19 Thread Koert Kuipers
Using a normal hive connection and authorization it seems to work for me: hive> revoke all on database default from user koert; OK Time taken: 0.043 seconds hive> create table tmp(x string); Authorization failed:No privilege 'Create' found for outputs { database:default}. Use show grant to get more

Do some basic operations on my hive warehouse from java

2011-10-19 Thread Koert Kuipers
I have the need to do some cleanup on my hive warehouse from java, such as deleting tables (both in metastore and the files on hdfs) I found out how to do this using remote connection: org.apache.hadoop.hive.service.HiveClient connects to a hive server with only a few lines of code, and it provide

Re: Exception when joining HIVE tables

2011-09-21 Thread Koert Kuipers
"select * from table" does not use map-reduce so it seems your error has to do with hadoop/map-reduce, not hive i would run some test for map-reduce On Wed, Sep 21, 2011 at 4:11 PM, Krish Khambadkone wrote: > Hi, I get this exception when I try to join two hive tables or even when I > use a spec

remove duplicates based on one (or a few) columns

2011-09-14 Thread Koert Kuipers
what is the easiest way to remove rows which are considered duplicates based upon a few columns in the rows? so "create table deduped as select distinct * from table" won't do...

Re: Hive thrift interface and user permissions / user auditing

2011-09-07 Thread Koert Kuipers
he way, I am still confused about user "thrift". Is there any process > run by user "thrift" > > Hope it helps, > Ashutosh > > On Tue, Sep 6, 2011 at 09:09, Koert Kuipers wrote: > >> The metastore is running as user "hive", and we are indeed runnin

Re: Hive thrift interface and user permissions / user auditing

2011-09-06 Thread Koert Kuipers
user through doAs() which > preserves the identity which is not the case in unsecure mode. > Through hive client you see the usernames correctly even In unsecure mode > because its a hive client process (which is run as koert) which does the > filesystem operations. > > Hope it helps, &g

Hive thrift interface and user permissions / user auditing

2011-09-06 Thread Koert Kuipers
When i run a query from the hive command line client i can see that it is being run as me (for example, in HDFS log i see INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=koert). But when i do anything with the thrift interface my username is lost (i see ugi=thrift in HDFS logs)

Re: UDAF and group by

2011-09-05 Thread Koert Kuipers
ssible that a group of data > is processed by multiple reducers and those two methods are needed. > > If you need to process records in each group in a single method, you can > first use collect_set to collect your group data and process them in a UDF. > > > 2011/9/4 Koert Ku

Re: UDAF and group by

2011-09-04 Thread Koert Kuipers
s? > > Can you give some examples of what kind of query you'd like to run? > > > 2011/8/30 Koert Kuipers > >> If i run my own UDAF with group by, can i be sure that a single UDAF >> instance initialized once will process all members in a group? Or should i >>

UDAF and group by

2011-08-30 Thread Koert Kuipers
If i run my own UDAF with group by, can i be sure that a single UDAF instance initialized once will process all members in a group? Or should i code so as to take into account the situation where even within a group multiple UDAFs could run, and i would have to deal with terminatePartial() and merg

Re: Re: multiple tables join with only one hug table.

2011-08-13 Thread Koert Kuipers
tc) > else discard the row. > end loop > > > At 2011-08-13 01:17:16,"Koert Kuipers" wrote: > > A mapjoin does what you described: it builds hash tables for the smaller > tables. In recent versions of hive (like the one i am using with cloudera > cdh3

Re: multiple tables join with only one hug table.

2011-08-12 Thread Koert Kuipers
A mapjoin does what you described: it builds hash tables for the smaller tables. In recent versions of hive (like the one i am using with cloudera cdh3u1) a mapjoin will be done for you automatically if you have your parameters set correctly. The relevant parameters in hive-site.xml are: hive.auto.

Re: Lzo Compression

2011-07-27 Thread Koert Kuipers
P_CLASSPATH:/home/ankit/hadoop-0.20.1/lib/hadoop-lzo-0.4.12.jar > export > JAVA_LIBRARY_PATH=/home/ankit/hadoop-0.20.1/lib/native/Linux-i386-32/ > > 9. Restart the cluster > > 10. uploaded lzo file into hdfs > > 11. Runned the following command for indexing: > bin/hadoop ja

Re: Lzo Compression

2011-07-26 Thread Koert Kuipers
my installation notes for lzo-hadoop (might be wrong or incomplete): we run centos 5.6 and cdh3 yum -y install lzo git checkout https://github.com/toddlipcon/hadoop-lzo.git cd hadoop-lzo ant cd build cp hadoop-lzo-0.4.10/hadoop-lzo-0.4.10.jar /usr/lib/hadoop/lib cp -r hadoop-lzo-0.4.10/lib/native

What sort of sequencefiles are created in hive

2011-07-25 Thread Koert Kuipers
Knowing that sequencefiles can store data (especially numeric data) much more compact that text, i started converting our hive database from lzo compressed text format to lzo compressed sequencdfiles. My first observation was that the files were not smaller, which surprised me since we have mostly

Re: conversion of left outer join to mapjoin

2011-07-25 Thread Koert Kuipers
anyone any idea? this seems like very strange behavior to me. and it blows up the job. On Fri, Jul 22, 2011 at 5:51 PM, Koert Kuipers wrote: > hello, > we have 2 tables x and y. table x is 11GB on disk and has 23M rows. table y > is 3GB on disk and has 28M rows. Both tables are stor

Re: Lzo Compression

2011-07-25 Thread Koert Kuipers
I have LZO compression enabled by default in hadoop 0.20.2 and hive 0.7.0 and it works well so far. On Mon, Jul 25, 2011 at 7:04 AM, Vikas Srivastava < vikas.srivast...@one97.net> wrote: > Hey , > > i just want to use any compression in hadoop so i heard about lzo which is > best among all the co

conversion of left outer join to mapjoin

2011-07-22 Thread Koert Kuipers
hello, we have 2 tables x and y. table x is 11GB on disk and has 23M rows. table y is 3GB on disk and has 28M rows. Both tables are stored as LZO compressed sequencefiles without bucketing. a normal join of x an y gets executed as a map-reduce-join in hive and works very well. an outer join also g

Re: hive mapjoin decision process

2011-07-19 Thread Koert Kuipers
s in memory locally, the local hashmap has to fail. So check > your machine's memory or the memory allocated for hive. > > Thanks > Yongqiang > On Tue, Jul 19, 2011 at 1:55 PM, Koert Kuipers wrote: > > thanks! > > i only see hive create the hashmap dump and perfo

Re: hive mapjoin decision process

2011-07-19 Thread Koert Kuipers
ashmap dump, it will go back > normal join.The reason here is mostly due to very good compression on > the input data. > 3) the mapjoin actually got started, and fails. it will fall back > normal join. This will most unlikely happen > > Thanks > Yongqiang > On Tue, Jul 19, 2

remote hive metastore

2011-07-19 Thread Koert Kuipers
i am testing running a remote hive metastore. i understand that the client communicates with the metastore via thrift. now is it the case that the client still communicates with HDFS directly? in the metastore i see logs for all the actions that i perform on the client. but they show up like this:

hive mapjoin decision process

2011-07-19 Thread Koert Kuipers
note: this is somewhat a repost of something i posted on the CDH3 user group. apologies if that is not appropriate. i am exploring map-joins in hive. with hive.auto.convert.join=true hive tries to do a map-join and then falls back on a mapreduce-join if certain conditions are not met. this sounds