Wow thanks everyone for the nice feedback!
I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
Cheers,
Ruben de Vries
-Original Message-
From: Mark Grover [mailto:mgro...@oanda.com]
Sent: Tuesday, April 24, 2012 3:17 AM
To: user@hive.apache.org; bejoy ks
Cc: Ru
Thanks all for their answers.
But I want to ask one more thing that:
1) I have written a program (my task) which contains Hive JDBC code and
code(commands of SQOOP) for importing the tables and exporting too.
If I create JAR of my program and put it on EMR, then should I need to
do some extra t
Just wrote up an article on how to install Sqoop on Amazon EMR:
http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/
--
Kyle Mulka
mu...@umich.edu
206 883 5352
http://www.kylemulka.com
On Mon, Apr 23, 2012 at 10:55 AM, Kyle Mulka wrote:
> It is possible to i
Added a tiny blurb here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals
Comments/suggestions welcome!
Thanks for bringing it up, Justin.
Mark
Mark Grover, Business Intelligence Analyst
OANDA Corporation
www: oanda.com www: fxtrade.com
e: mg
Thanks Nitin. I am aware of what Hive is doing. The question is, is it
okay not return an error/warning when no data is found since the
metadata for the table also contains the data location when you create
the table (which creates the hdfs directory as well). So, if somebody
erroneously removes t
hive table meta data is stored into a meta data store which will retain the
table structure and other meta info even if you delete hdfs table directory
as its stored in metadata store db.
When you do a select * from table;
1) hive checks for table exists in metadata store
2) if table is existing t
Hi Bhavesh,
To answer your questions:
1) S3 terminology uses the word "object" and I am sure they have good reasons
as to why but for us Hive'ers, an S3 object is the same as a file stored on S3.
The complete path to the file would be what Amazon calls the S3 "key" and the
corresponding value
Hi Ruben,
Like Bejoy pointed out, members_map is small enough to fit in memory, so your
joins with visit_stats would be much faster with map-side join.
However, there is still some virtue in bucketing visit_stats. Bucketing can
optimize joins, group by's and potentially other queries in certain
If data is in hdfs, then you can bucket it only after loading into a
temp/staging table and then to the final bucketed table. Bucketing needs a Map
reduce job.
Regards
Bejoy KS
Sent from handheld, please excuse typos.
-Original Message-
From: Ruben de Vries
Date: Mon, 23 Apr 2012 18:
Thanks for the help so far guys,
I bucketed the members_map, it's 330mb in size (11 mil records).
Can you manually bucket stuff?
Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA
to import stuff into the visit_stats tables, replacing that with INSERT
OVERWRITE SELECT
There are many good reasons to use bucketing. I have these rules for
when to dump partitioning in favor of bucketing:
1) too many partitions a day (500 + partitions a day, # of file issues)
2) an unpredictable number of partitions per day.
You can weight these factors with other benefits buckets
For Bucketed map join, both tables should be bucketed and the number of buckets
of one should be multiple of other.
Regards
Bejoy KS
Sent from handheld, please excuse typos.
-Original Message-
From: "Bejoy KS"
Date: Mon, 23 Apr 2012 16:03:34
To:
Reply-To: bejoy...@yahoo.com
Subject:
Bucketed map join would be good I guess. What is the total size of the smaller
table and what is its expected size in the next few years?
The size should be good enough to be put in Distributed Cache, then map side
joins would offer you much performance improvement.
Regards
Bejoy KS
Sent from
Ok, very clear on the partitions, try to make them match the WHERE clauses, not
so much about group clauses then ;)
The member_map contains 11.636.619 records atm, I think bucketing those would
be good?
What's a good number to bucket them by then?
And is there any point in bucketing the visit_s
If you're only interested in a certain window of dates for analysis, a
date-based partition scheme will be helpful, as it will trim partitions that
aren't needed by the query before execution.
If the member_map table is small, you might consider testing the feasibility of
map-side joins, as it
Partitions are good when you want to run your queries on a subset of whole
data. So the partition column depends on your queries. But a good point to be
taken care is that every partition have enough data.
Partition gets into effect when you use filters with Where clause.
Buckets are good for sa
It seems there's enough information to be found on how to setup and use
partitions and buckets.
But I'm more interested in how to figure out when and what columns you should
be partitioning and bucketing to increase performance?!
In my case I got 2 tables, 1 visit_stats (member_id, date and some
It is possible to install Sqoop on AWS EMR. I've got some scripts I can publish
later. You are not required to use S3 to store files and can use the local
(temporary) HDFS instead. After you have Sqoop installed, you can import your
data with it into HDFS, run your calculations in HDFS, then exp
Once you have a Hive Job flow running on Amazon EMR, you'll have access to the
file system on the underlying EC2 machines (you'll get the machine name, etc
once the cluster is running). You can then move your data files on the EC2
machine file system and load it into HDFS/Hive. I am not sure abo
Hi,
I know that it is possible to use regex column specification with the
SELECT clause like so:
SELECT `employee.*` FROM employees;
I was wondering if it is possible to use it with the WHERE clause also.
For example I want to create the query:
SELECT `employee.*` FROM employees WHERE $1
It's a bit of a weird case but I thought I might share it and hopefully find
someone who can confirm this to be a bug or tell me I should do things
differently!
Here you can find a pastie with the full create and select queries:
http://pastie.org/3838924
I've got two tables:
`visit_stats` with
Hello all,
I want to deploy my task on Amazon EMR. But as I am new to Amazon Web
Services I am confused in understanding the concepts.
My Use Case:
I want to import the large data from EC2 through SQOOP into the Hive.
Imported data in Hive will get processed in Hive by applying some algorithm
and
Hello All,
Thank you much for the responses. I can confirm that the lag function
implementation works in my case:
create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id)
from (select session_id,hit_datetime_gmt from omni2
23 matches
Mail list logo