RE: Sessionize using Hive

2016-02-01 Thread Ryan Harris
it can be done in hive...whether or not it is the "best choice" depends on whether or not you have any other reason for your data to be in hive. If you are wondering whether Hive is the best tool for accomplishing this one taskit would probably be easier to do in pig. From: JB Rawlings [mail

Sessionize using Hive

2016-02-01 Thread JB Rawlings
We are considering whether Hive is the best choice for "sessionizing" a set of data given the following parameters: * Input data set: A series of records with userID, startTimstamp, EndTimestamp, recordType, etc. * Output data set: Same records (no aggregation) with an added

RE: ORC format

2016-02-01 Thread Mich Talebzadeh
Thanks Alan for this explanation. Interesting to see Primary Key in Hive. Sometimes comparison is made between Hive Storage Index concept in Orc and Oracle Exadata storage index that also uses the same terminology! It is a bit of a misnomer to call Oracle Exadata indexes a “storage ind

Re: Adding custom jars for hive and map-reduce jobs dynamically from hdfs location

2016-02-01 Thread Yehuda Finkelstein
hive.aux.jars.path /var/lib/hive Try this setting On Feb 1, 2016 7:46 PM, "Chagarlamudi, Prasanth" < prasanth.chagarlam...@epsilon.com> wrote: > Hello, > > We currently have custom jars(util, serde etc) that needs to deployed > into /hive/lib folder and mapreduce folders whenever they are

Adding custom jars for hive and map-reduce jobs dynamically from hdfs location

2016-02-01 Thread Chagarlamudi, Prasanth
Hello, We currently have custom jars(util, serde etc) that needs to deployed into /hive/lib folder and mapreduce folders whenever they are updated/modified and needs a restart whenever changed. I am currently trying to see if I can place these jars on a hdfs location and read them dynamically.

Re: ORC format

2016-02-01 Thread Alan Gates
ORC does not currently expose a primary key to the user, though we have talked of having it do that. As Mich says the indexing on ORC is oriented towards statistics that help the optimizer plan the query. This can be very important in split generation (determining which parts of the input wil

RE: ORC format

2016-02-01 Thread Mich Talebzadeh
Please also bear in mind that data in Hive supposed to be immutable (although later version allow one to update data in Hive). Data coming to Hive is supposed to be cleansed already sourced from other transactional databases etc. So the concept of primary key (meaning enforcing uniqueness fo

RE: ORC format

2016-02-01 Thread Mich Talebzadeh
In relational databases say Oracle or Sybase there is only one primary key for a given table. So by definition you can have one primary key on any table consists of one column or composite primary key (multiple columns). Please check threads on “ORC files and statistics” in this forum.for de

Re: ORC format

2016-02-01 Thread Philip Lee
What do you mean by the silver bullet? so you mean it is not that stored as primary key on each column. It is just stored as storage indexing, right? "The statistics helps the optimiser. So whether one table or many, the optimiser will take advantage of stats to push down the predicate for faster

Re: ORC format

2016-02-01 Thread Philip Lee
Also, when making ORC from CSV, for indexing every key on each coulmn is made, or a primary on a table is made ? If keys are made on each column in a table, accessing any column in some functions like filtering should be faster. On Mon, Feb 1, 2016 at 4:21 PM, Philip Lee wrote: > Hello, > > I e

RE: ORC format

2016-02-01 Thread Mich Talebzadeh
Use orcfiledump to see the stats for each column etc Example hive --orcfiledump --rowindex 1,2 /user/hive/warehouse/oraclehadoop.db/orctest/00_0 Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 G

RE: ORC format

2016-02-01 Thread Mich Talebzadeh
Hi, Orc table use what is known as storage index with stats (min, max. sum etc) stored at the table, stripe and rowindex (rows of 10K batches) level. The statistics helps the optimiser. So whether one table or many, the optimiser will take advantage of stats to push down the predicate for fa

ORC format

2016-02-01 Thread Philip Lee
Hello, I experiment the performance of some systems between ORC and CSV file. I read about ORC documentation on Hive website, but still curious of some things. I know ORC format is faster on filtering or reading because it has indexing. Has it advantage of joining two tables of ORC dataset as wel