We use Hive to manage 100's of millions machine log data files. These files are semi-structured. Semi-structured in that we don't care about the full structure of the file up front, nor do they have a format that's easy to understand.
Even data with less structure (e.g. Medical notes) there is always metadata about the data and context. This metadata and the 'blob' of data can fit in a row of a Hive table. We use UDFs and UDTFs to parse the blob portion of the data on an as needed basis. Another pattern is using a sequence file. The value contains the blob, the key contains the concatenated metadata object (think Avro encoding). Storage can be on HDFS or in HBase. The choice depends more on read and write access pattern requirements more than what level of structure the data has. The processing tool (Pig / Hive / Map Reduce) choice is better influenced by the type of data flows (data pipelines) you need to build more so than how much structure the data has. The one exception is nested data, I find Pig handles this more easily than Hive does. The trick to managing semi-structured data via Hive/Pig is through the use of UDFs for parsing what you need when you need it. All of the tools above support UDFs. Map Reduce does it too because it's already operating at the 'assembly language' level anyways. - Douglas From: Bill Busch <bigdat...@outlook.com<mailto:bigdat...@outlook.com>> Reply-To: <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Wed, 3 Dec 2014 20:59:46 -0500 To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: RE: Question MapReduce can be used for both structure and unstructured data. Hive is a storage and retrieval mechanism (e.g. database). The trouble with RDBMS is that you either have to parse the unstructured data into a structured row /column format OR store it as an object. There are issues both performance and semantically . Hence, there is a whole world of NoSQL databases out there that have been developed that are not row-column structured. These databases can handle more schema-less/unstructured objects and will allow you to more eloquently manipulate your information. I would check out the Wikipedia page on NoSQL databases and focus on Key - Value, Columnar, or Document databases. ________________________________ Date: Thu, 4 Dec 2014 07:06:16 +0530 Subject: Re: Question From: mohan.25fe...@gmail.com<mailto:mohan.25fe...@gmail.com> To: user@hive.apache.org<mailto:user@hive.apache.org> Thanks Gabriel for the prompt response I see in online blogs saying MapReduce for Unstructured Data , Pig for Semi Sturctured Data and Hive is only for Structured Data. Can you please justify this? Thanks in advance On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <gabrieleisbr...@gmail.com<mailto:gabrieleisbr...@gmail.com>> wrote: Hi Mohan, We are using hive for unstructured (or semi structured data) using map columns, for example, we use for fixed data standard columns and form dynamic data map columns. Gabriel. 2014-12-03 22:19 GMT-03:00 Mohan Krishna <mohan.25fe...@gmail.com<mailto:mohan.25fe...@gmail.com>>: Hive is for only structured data or it handles Unstructured data as well ?