The only real difference between a managed table and an external table is the location of the data file on HDFS.
- Managed table: Say you have some data you want to analyze in Hive, with fields (foo, bar). So you CREATE a table "table_name", with fields (foo,bar) and LOAD the data into that table using the Hive cli client. That's a 'managed' table, and hive will decide where the data gets stored on HDFS ( probably /hive/warehouse/table_name/...) - External table: Say that instead, you have some data already on HDFS, maybe in /user/login/data/..., also of the form (foo,bar). Maybe you use this data in some MapReduce processes, so you wouldn't want to move it out of the directory it's in. But maybe you DO want to analyze it using Hive. So you CREATE EXTERNAL a table "table_name", and include the path to the data. In this case, you don't LOAD the data using the Hive client, because the data is already on HDFS. You just point Hive to the location of the data. You can find out where Hive thinks the data for a table is by issuing the command: describe extended table_name; The path is in the (somewhat messy) result. To convince yourself that there's really NO DIFFERENCE between a managed table and an external table, create a managed table and then hadoop fs -mv the data for it from an 'external' location into /hive/warehouse/table_name/... as long as the schema matches up with the file, hive will magically find the data. One important realization is that Hive doesn't "do" anything to your data in either case. The data is just stored as a set of files on HDFS (say, tab delimited), the contents of which look roughly like: foo1\tbar1 foo2\tbar2 foo3\tbar3 foo4\tbar4 etc... http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL -s P.S. As an aside, it should be clear by now that Ranjith's comment is correct: if you want multiple schemas for the same data, you'll need to make N-1 of them EXTERNAL tables, since Hive will create a copy of your data under /hive/warehouse/tableN/... for each table you LOAD to. You could create the first table as a managed table, and then point all the others to the directory which Hive creates. On Tue, Apr 26, 2011 at 9:42 AM, Raghunath, Ranjith < [email protected]> wrote: > From my understanding of external tables they are great if we need > multiple schemas associated with the same base data. A place where this > could come handy is if we using custom SerDe implementation. The drawback > that I have noticed has to do with binary file formats. One will have to use > a custom map reduce job to generate the base data if you plan to use that as > an external table. Please feel free to correct me here if I am wrong. > > > Thanks, > Ranjith > > ------------------------------ > *From*: Shantian Purkad <[email protected]> > *To*: [email protected] <[email protected]> > *Sent*: Tue Apr 26 11:18:46 2011 > *Subject*: External tables pros & cons > > Hi, > > What is the difference between external table and managed tables (apart > from data being stored outside hive warehouse and not deleted when table is > dropped) > > Are there any drawbacks of external tables? Or lack of support for > certain features (such as indexes, joins, subqueries etc etc )? > > When and why should one not use external tables? > > Thanks and Regards, > Shantian >
