ORC was created inside hive instead of (as it should have been i think) as
a file format library that hive can depend on, and other frameworks as
well. it seems to be part of hive's annoying tendency not to think of
itself as a java library.
On Thu, Feb 18, 2016 at 2:38 AM, Abhishek Dubey
wrote:
r the hive book I highlighted many smart capably organizations use hive.
>
> Your argument is totally valid. You like X better because X works for you.
> You don't need to 'preach' hear we all know hive has it's limits.
>
> On Thu, Feb 4, 2016 at 10:55 AM, Koert
; The reality is that once you start factoring in the numerous tuning
> parameters of the systems and jobs there probably isn't a clear answer.
> For some queries, the Catalyst optimizer may do a better job...is it going
> to do a better job with ORC based data? less likely IMO.
>
>
t seems like hive
> and tez made spark say uncle...
>
> https://www.slideshare.net/mobile/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final
>
>
> On Wednesday, February 3, 2016, Koert Kuipers wrote:
>
>> ok i am sure there is some way to do it. i am going to guess s
available at all. so yeah
yeah i am sure it can be made to *work*. just like you can get a nail into
a wall with a screwdriver if you really want.
On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers wrote:
> yeah but have you ever seen somewhat write a real analytical program in
> hive? how?
you are not boxed in by a long shot.
>
>
> On Tuesday, February 2, 2016, Koert Kuipers wrote:
>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of
tated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 F
uuuhm with spark using Hive metastore you actually have a real programming
environment and you can write real functions, versus just being boxed into
some version of sql and limited udfs?
On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang wrote:
> When comparing the performance, you need to do it apple
touch them very much. Usually if they do change it
>> is something small and if you tie the commit to a jira you can figure out
>> what and why.
>>
>> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers wrote:
>>
>>> seems the metastore thrift service support SASL. t
version deployed.
in that case i admit its not as bad as i thought. lets see!
On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers wrote:
> oh sorry edward, i misread you post. seems we agree that "SQL constructs
> inside hive" are not for other systems.
>
> On Sat, Jan 31, 2015 at
oh sorry edward, i misread you post. seems we agree that "SQL constructs
inside hive" are not for other systems.
On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers wrote:
> edward,
> i would not call "SQL constructs inside hive" accessible for other
> systems. its in
521269638&ref=pd_sl_4yiryvbf8k_e
> there is even examples where I show how to iterate all the tables inside
> the database from a java client.
>
> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers wrote:
>
>> yes you can run whatever you like with the data in hdfs. keep in
by HBase) will be needed for that.
On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers wrote:
> yes you can run whatever you like with the data in hdfs. keep in mind that
> hive makes this general access pattern just a little harder, since hive has
> a tendency to store data and metadata s
ver paradigm they like?
>
> E.g.: GraphX, Mahout &etc.
>
> Also, what about Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> PS: Spark-SQL is read-only IIRC, right?
> On 31 Jan 2015 03:39, "Koert Kuipers" wrote:
>
since you require high-powered analytics, and i assume you want to stay
sane while doing so, you require the ability to "drop out of sql" when
needed. so spark-sql and lingual would be my choices.
low latency indicates phoenix or spark-sql to me.
so i would say spark-sql
On Fri, Jan 30, 2015 at
I am working on a hive SerDe where both SerDe and RecordReader need to have
access to an external resource with information.
This external resource could be on hdfs, in hbase, or on a http server.
This situation is very similar to what haivvreo does.
The way i go about it right now is that i store
I make changes to the Configuration in my SerDe expecting those to be
passed to the InputFormat (and OutputFormat). Yet the InputFormat seems to
get an unchanged JobConf? Is this a known limitation?
I find it very confusing since the Configuration is the main way to
communicate with the MapReduce
1) Is there a way in initialize() of a SerDe to know if it is being used as
a Serializer or a Deserializer. If not, can i define the Serializer and
Deserializer separately instead of defining a SerDe (so i have two
initialize methods)?
2) Is there a way to find out which columns are being used? sa
hello all,
we have an external partitioned table in hive.
we add to this table by having map-reduce jobs (so not from hive) create
new subdirectories with the right format (partitionid=partitionvalue).
however hive doesn't pick them up automatically. we have to go into hive
shell and run "alter
Hello,
>>> Have you tried running only select, without creating table? What are
>>> results?
>>> How did you tried to set number of reducers? Have you used this:
>>> set mapred.reduce.tasks = xyz;
>>> How many mappers does this query use?
>>>
/**
> joinoptimization.html<https://cwiki.apache.org/Hive/joinoptimization.html>
>
> Especially try this setting:
> set hive.auto.convert.join = true; (or false)
>
> Which version of Hive are you using?
>
>
> On 13.01.2012 00:24, Koert Kuipers wrote:
>
>> hive>
angiewicz
wrote:
> What do you mean by "Select runs fine" - is it using number of reducers
> that you set?
> It might help if you could show actual query.
>
>
> On 13.01.2012 00:03, Koert Kuipers wrote:
>
>> I tried set mapred.reduce.tasks = xyz; hive ignored it.
e you used this:
>> set mapred.reduce.tasks = xyz;
>> How many mappers does this query use?
>>
>>
>> On 12.01.2012 23:53, Koert Kuipers wrote:
>>
>>> I am running a basic join of 2 tables and it will only run with 1
>>> reducer.
>>> why is that?
et number of reducers? Have you used this:
> set mapred.reduce.tasks = xyz;
> How many mappers does this query use?
>
>
> On 12.01.2012 23:53, Koert Kuipers wrote:
>
>> I am running a basic join of 2 tables and it will only run with 1 reducer.
>> why is that? i tried to set
I am running a basic join of 2 tables and it will only run with 1 reducer.
why is that? i tried to set the number of reducers and it didn't work. hive
just ignored it.
create table z as select x.* from table1 x join table2 y where (
x.col1 = y.col1 and
x.col2 = y.col2 and
x.col3 = y.col3 and
x.col
l protocol = new TBinaryProtocol(transport);
HiveClient client = new HiveClient(protocol);
client.drop_table(database, table, true)
Do you know of any similar interface for a local hive? Thanks
On Wed, Oct 19, 2011 at 5:28 PM, Edward Capriolo wrote:
>
>
> On Wed, Oct 19, 2011 at 5:18 PM, K
logged in?
I thought the actual username was passed along in thrift if authorization
was enabled, and that the actual username would be used for authorization.
Am i wrong about this?
On Wed, Oct 19, 2011 at 6:01 PM, Koert Kuipers wrote:
> Using a normal hive connection and authorization it se
Using a normal hive connection and authorization it seems to work for me:
hive> revoke all on database default from user koert;
OK
Time taken: 0.043 seconds
hive> create table tmp(x string);
Authorization failed:No privilege 'Create' found for outputs {
database:default}. Use show grant to get more
I have the need to do some cleanup on my hive warehouse from java, such as
deleting tables (both in metastore and the files on hdfs)
I found out how to do this using remote connection:
org.apache.hadoop.hive.service.HiveClient connects to a hive server with
only a few lines of code, and it provide
"select * from table" does not use map-reduce
so it seems your error has to do with hadoop/map-reduce, not hive
i would run some test for map-reduce
On Wed, Sep 21, 2011 at 4:11 PM, Krish Khambadkone
wrote:
> Hi, I get this exception when I try to join two hive tables or even when I
> use a spec
what is the easiest way to remove rows which are considered duplicates based
upon a few columns in the rows?
so "create table deduped as select distinct * from table" won't do...
he way, I am still confused about user "thrift". Is there any process
> run by user "thrift"
>
> Hope it helps,
> Ashutosh
>
> On Tue, Sep 6, 2011 at 09:09, Koert Kuipers wrote:
>
>> The metastore is running as user "hive", and we are indeed runnin
user through doAs() which
> preserves the identity which is not the case in unsecure mode.
> Through hive client you see the usernames correctly even In unsecure mode
> because its a hive client process (which is run as koert) which does the
> filesystem operations.
>
> Hope it helps,
&g
When i run a query from the hive command line client i can see that it is
being run as me (for example, in HDFS log i see INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=koert).
But when i do anything with the thrift interface my username is lost (i see
ugi=thrift in HDFS logs)
ssible that a group of data
> is processed by multiple reducers and those two methods are needed.
>
> If you need to process records in each group in a single method, you can
> first use collect_set to collect your group data and process them in a UDF.
>
>
> 2011/9/4 Koert Ku
s?
>
> Can you give some examples of what kind of query you'd like to run?
>
>
> 2011/8/30 Koert Kuipers
>
>> If i run my own UDAF with group by, can i be sure that a single UDAF
>> instance initialized once will process all members in a group? Or should i
>>
If i run my own UDAF with group by, can i be sure that a single UDAF
instance initialized once will process all members in a group? Or should i
code so as to take into account the situation where even within a group
multiple UDAFs could run, and i would have to deal with terminatePartial()
and merg
tc)
> else discard the row.
> end loop
>
>
> At 2011-08-13 01:17:16,"Koert Kuipers" wrote:
>
> A mapjoin does what you described: it builds hash tables for the smaller
> tables. In recent versions of hive (like the one i am using with cloudera
> cdh3
A mapjoin does what you described: it builds hash tables for the smaller
tables. In recent versions of hive (like the one i am using with cloudera
cdh3u1) a mapjoin will be done for you automatically if you have your
parameters set correctly. The relevant parameters in hive-site.xml are:
hive.auto.
P_CLASSPATH:/home/ankit/hadoop-0.20.1/lib/hadoop-lzo-0.4.12.jar
> export
> JAVA_LIBRARY_PATH=/home/ankit/hadoop-0.20.1/lib/native/Linux-i386-32/
>
> 9. Restart the cluster
>
> 10. uploaded lzo file into hdfs
>
> 11. Runned the following command for indexing:
> bin/hadoop ja
my installation notes for lzo-hadoop (might be wrong or incomplete):
we run centos 5.6 and cdh3
yum -y install lzo
git checkout https://github.com/toddlipcon/hadoop-lzo.git
cd hadoop-lzo
ant
cd build
cp hadoop-lzo-0.4.10/hadoop-lzo-0.4.10.jar /usr/lib/hadoop/lib
cp -r hadoop-lzo-0.4.10/lib/native
Knowing that sequencefiles can store data (especially numeric data) much
more compact that text, i started converting our hive database from lzo
compressed text format to lzo compressed sequencdfiles.
My first observation was that the files were not smaller, which surprised me
since we have mostly
anyone any idea? this seems like very strange behavior to me. and it blows
up the job.
On Fri, Jul 22, 2011 at 5:51 PM, Koert Kuipers wrote:
> hello,
> we have 2 tables x and y. table x is 11GB on disk and has 23M rows. table y
> is 3GB on disk and has 28M rows. Both tables are stor
I have LZO compression enabled by default in hadoop 0.20.2 and hive 0.7.0
and it works well so far.
On Mon, Jul 25, 2011 at 7:04 AM, Vikas Srivastava <
vikas.srivast...@one97.net> wrote:
> Hey ,
>
> i just want to use any compression in hadoop so i heard about lzo which is
> best among all the co
hello,
we have 2 tables x and y. table x is 11GB on disk and has 23M rows. table y
is 3GB on disk and has 28M rows. Both tables are stored as LZO compressed
sequencefiles without bucketing.
a normal join of x an y gets executed as a map-reduce-join in hive and works
very well. an outer join also g
s in memory locally, the local hashmap has to fail. So check
> your machine's memory or the memory allocated for hive.
>
> Thanks
> Yongqiang
> On Tue, Jul 19, 2011 at 1:55 PM, Koert Kuipers wrote:
> > thanks!
> > i only see hive create the hashmap dump and perfo
ashmap dump, it will go back
> normal join.The reason here is mostly due to very good compression on
> the input data.
> 3) the mapjoin actually got started, and fails. it will fall back
> normal join. This will most unlikely happen
>
> Thanks
> Yongqiang
> On Tue, Jul 19, 2
i am testing running a remote hive metastore. i understand that the client
communicates with the metastore via thrift.
now is it the case that the client still communicates with HDFS directly?
in the metastore i see logs for all the actions that i perform on the
client. but they show up like this:
note: this is somewhat a repost of something i posted on the CDH3 user
group. apologies if that is not appropriate.
i am exploring map-joins in hive. with hive.auto.convert.join=true hive
tries to do a map-join and then falls back on a mapreduce-join if certain
conditions are not met. this sounds
49 matches
Mail list logo