it can be done in hive...whether or not it is the "best choice" depends on
whether or not you have any other reason for your data to be in hive.
If you are wondering whether Hive is the best tool for accomplishing this one
taskit would probably be easier to do in pig.
From: JB Rawlings [mail
We are considering whether Hive is the best choice for "sessionizing" a set of
data given the following parameters:
* Input data set: A series of records with userID, startTimstamp,
EndTimestamp, recordType, etc.
* Output data set: Same records (no aggregation) with an added
Thanks Alan for this explanation. Interesting to see Primary Key in Hive.
Sometimes comparison is made between Hive Storage Index concept in Orc and
Oracle Exadata storage index that also uses the same terminology!
It is a bit of a misnomer to call Oracle Exadata indexes a “storage ind
hive.aux.jars.path
/var/lib/hive
Try this setting
On Feb 1, 2016 7:46 PM, "Chagarlamudi, Prasanth" <
prasanth.chagarlam...@epsilon.com> wrote:
> Hello,
>
> We currently have custom jars(util, serde etc) that needs to deployed
> into /hive/lib folder and mapreduce folders whenever they are
Hello,
We currently have custom jars(util, serde etc) that needs to deployed into
/hive/lib folder and mapreduce folders whenever they are updated/modified and
needs a restart whenever changed.
I am currently trying to see if I can place these jars on a hdfs location and
read them dynamically.
ORC does not currently expose a primary key to the user, though we have
talked of having it do that. As Mich says the indexing on ORC is
oriented towards statistics that help the optimizer plan the query.
This can be very important in split generation (determining which parts
of the input wil
Please also bear in mind that data in Hive supposed to be immutable (although
later version allow one to update data in Hive).
Data coming to Hive is supposed to be cleansed already sourced from other
transactional databases etc. So the concept of primary key (meaning enforcing
uniqueness fo
In relational databases say Oracle or Sybase there is only one primary key for
a given table. So by definition you can have one primary key on any table
consists of one column or composite primary key (multiple columns).
Please check threads on “ORC files and statistics” in this forum.for de
What do you mean by the silver bullet? so you mean it is not that stored as
primary key on each column. It is just stored as storage indexing, right?
"The statistics helps the optimiser. So whether one table or many, the
optimiser will take advantage of stats to push down the predicate for
faster
Also,
when making ORC from CSV,
for indexing every key on each coulmn is made, or a primary on a table is
made ?
If keys are made on each column in a table, accessing any column in some
functions like filtering should be faster.
On Mon, Feb 1, 2016 at 4:21 PM, Philip Lee wrote:
> Hello,
>
> I e
Use orcfiledump to see the stats for each column etc
Example
hive --orcfiledump --rowindex 1,2
/user/hive/warehouse/oraclehadoop.db/orctest/00_0
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 G
Hi,
Orc table use what is known as storage index with stats (min, max. sum etc)
stored at the table, stripe and rowindex (rows of 10K batches) level. The
statistics helps the optimiser. So whether one table or many, the optimiser
will take advantage of stats to push down the predicate for fa
Hello,
I experiment the performance of some systems between ORC and CSV file.
I read about ORC documentation on Hive website, but still curious of some
things.
I know ORC format is faster on filtering or reading because it has indexing.
Has it advantage of joining two tables of ORC dataset as wel
13 matches
Mail list logo