Re: Start hiveserver2 as a daemon

2014-12-05 Thread Jörn Franke
Have you tried nohup ? Le 5 déc. 2014 15:25, "peterm_second" a écrit : > Hi Guys, > How can I launch the Hiveserver2 as a daemon. > I am launching the hiverserv2 using sshpass and I can't detach hiveserver2 > from my terminal. Is there a way to deamonise the hiveserver2 ? > > I've also tried usi

Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-01 Thread Jörn Franke
Hallo, I think you have to think first about your functional and non-functional requirements. You can scale "normal" SQL databases as well (cf CERN or Facebook). There are different types of databases for different purposes - there is no one fits it all. At the moment, we are a few years away from

Re: Create ORC Table on Tez Failed

2015-06-04 Thread Jörn Franke
Might be an access right problem of the hive server user. Le jeu. 4 juin 2015 à 11:53, Chinna Rao Lalam a écrit : > Hi, > > If you are table name is orc_table in the exception i can see the table > name as "test" > > Moving data to: hdfs://:8020/apps/hive/warehouse/test > > Failed with exception

Verifying that a query uses orc bloom filters, orc storage indexes

2015-07-30 Thread Jörn Franke
Hi, Is there any official way to verify that a query leveraged orc bloom filters or orc indexes? For example, number of bytes (rows) not processed thanks to bloom filters or storage indexes? Some indicators in the explain output? Thank you. Best reagrds

Re: Hive on Tez much slower than MR

2015-08-06 Thread Jörn Franke
Always use the newest version of Hive. You should use orc or parquet wherever possible. If you use orc then you should explicitly enable storage indexes and insert your table sorted (eg for the query below you would sort on x). Additionally you should enable statistics. Compression may bring addit

Re: Persistent (and possibly asynchronous) Hive access from within Scala

2015-08-07 Thread Jörn Franke
I have no problems to use jdbc for hiveserver2. I think you need the hive*jdbc*standalone.jar and i think hadoop-commons*.jar Le ven. 7 août 2015 à 5:23, Stephen Bly a écrit : > What library should I use if I want to make persistent connections from > within Scala/Java? I’m working on a web ser

Re: Error starting the Hive Shell

2015-08-13 Thread Jörn Franke
Maybe there is another older log4j library in the classpath? Le ven. 14 août 2015 à 5:34, Praveen Sripati a écrit : > Hi, > > I installed Java 1.8.0_51, Hadoop 1.2.1 and Hive 1.2.1 on Ubuntu 14.04 64 > bit, I do get the below exception when I start the hive shell or the > beeline. How do I get a

Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Jörn Franke
Additionally, although it is a PoC you should have a realistic data model. Furthermore, following good data modeling practices should be taken into account. Joining on a double is not one of them. It should be int. Furthermore, double is a type that is in most scenarios rarely used. In the business

Re: HiveMetaStoreClient

2015-08-26 Thread Jörn Franke
What about using the hcatalog apis? Le mer. 26 août 2015 à 8:27, Jerrick Hoang a écrit : > Hi all, > > I want to interact with HiveMetaStore table from code and was looking at > http://hive.apache.org/javadocs/r0.13.1/api/metastore/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html > , wa

Re: HiveMetaStoreClient

2015-08-26 Thread Jörn Franke
Why not use hcatalog web service api? Le mer. 26 août 2015 à 18:44, Jerrick Hoang a écrit : > Ok, I'm super confused now. The hive metastore is a RDBMS database. I > totally agree that I shouldn't access it directly via jdbc. So what about > using this class > http://hive.apache.org/javadocs/r0.

Re: HiveServer with LDAP

2015-09-19 Thread Jörn Franke
What do you mean by it is not working? You may also check the logs of your lap server... Maybe there is also a limitations of number of logins in your lap server... Maybe the account is temporarily blocked because you entered the password wrongly too many times... Le ven. 18 sept. 2015 à 10:34, Lo

Re: Getting dot files for DAGs

2015-09-30 Thread Jörn Franke
Why not use tez ui? Le jeu. 1 oct. 2015 à 2:29, James Pirz a écrit : > I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries. > I am interested in checking DAGs for my queries visually, and I realized > that I can do that by graphviz once I can get "dot" files of my DAGs. My > issue is I can no

Re: Hive 1.2.1 installation troubleshooting - No known driver to handle "jdbc://hive2://:10000"

2015-10-08 Thread Jörn Franke
You could edit the beeline script and add the driver there to the classpath Le jeu. 8 oct. 2015 à 16:02, Timothy Garza a écrit : > I’ve installed Hive 1.2.1 on Amazon Linux AMI release 2015.03, master-node > of Hadoop cluster. > > > > I can successfully access the Beeline client but when I try t

Re: clarification please

2015-10-29 Thread Jörn Franke
> On 29 Oct 2015, at 06:43, Ashok Kumar wrote: > > hi gurus, > > kindly clarify the following please > > Hive currently does not support indexes or indexes are not used in the query Not correct. See https://snippetessay.wordpress.com > The lowest granularity for concurrency is partition. If ta

Re: Best way to load CSV file into Hive

2015-10-31 Thread Jörn Franke
You clearly need to escape those characters as for any other tool. You may want to use avro instead of csv , xml or JSON etc > On 30 Oct 2015, at 19:16, Vijaya Narayana Reddy Bhoomi Reddy > wrote: > > Hi, > > I have a CSV file which contains hunderd thousand rows and about 200+ > columns. So

Re: Hive Insert taking a lot of time

2015-11-02 Thread Jörn Franke
What is the create table statement? You may want to insert everything into the orc table (sorted on x and/or y) and then apply the where statement in your queries on the orc table. > On 02 Nov 2015, at 13:36, Kashif Hussain wrote: > > Hi, > I am trying to insert data into orc table from a tex

Re: Min-Max Index vs Bloom filter

2015-11-02 Thread Jörn Franke
Bloom Filter only works for = and min max for <>= , however the latter only works for numeric value while the bloom filter nearly works on all types. Additionally the bloom filter is a probabilistic data structure. For both it make sense that the data is sorted on the column which is most select

Re: hive metastore update from 0.12 to 1.0

2015-11-03 Thread Jörn Franke
Probably you started the new Hive version before upgrading the schema. This means manual fixing. > On 03 Nov 2015, at 11:56, Sanjeev Verma wrote: > > Hi > > I am trying to update the metastore using schematool but getting error > > schematool -dbType derby -upgradeSchemaFrom 0.12 > > Upg

Re: Hive alternatives?

2015-11-05 Thread Jörn Franke
First it depends on what you want to do exactly. Second, Hive > 1.2, Tez as an Execution Engine (I recommend >= 0.8) and Orc as storage format can be pretty quick depending on your use case. Additionally you may want to employ compression which is a performance boost once you understand how stor

Re: Hive and HBase

2015-11-10 Thread Jörn Franke
Probably it is outdated. Hive can access hbase tables via external tables. The execution engine in Hive can be mr, tez, spark. Hiveql is nowadays very similar to sql . In fact, Hortonworks plans to make it sql2011:analytics compatible. Hbase can be accessed independently of Hive via sql using P

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-20 Thread Jörn Franke
I recommend to use a Hadoop distribution containing these technologies. I think you get also other useful tools for your scenario, such as Auditing using sentry or ranger. > On 20 Nov 2015, at 10:48, Mich Talebzadeh wrote: > > Well > > “I'm planning to deploy Hive on Spark but I can't find t

Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu

2015-11-20 Thread Jörn Franke
I think the most recent versions of cloudera or Hortonworks should include all these components - try their Sandboxes. > On 20 Nov 2015, at 12:54, Dasun Hegoda wrote: > > Where can I get a Hadoop distribution containing these technologies? Link? > >> On Fri, Nov 20, 201

Re: Building Rule Engine/ Rule Transformation

2015-11-29 Thread Jörn Franke
Why not implement Hive UDF in Java? > On 28 Nov 2015, at 21:26, Mahender Sarangam > wrote: > > Hi team, > > We need expert input to discuss how to implement Rule engine in hive. Do you > have any references available to implement rule in hive/pig. > > > We are migrating our Stored Proced

Re: Using spark in tandem with Hive

2015-12-01 Thread Jörn Franke
How did you create the tables? Do you have automated statistics activated in Hive? Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for sub second queries ) or use Spark as an execution engine in Hive. > On 01 Dec 2015, at 17:40, Mich Talebzadeh wrote: > > What if we

Re: Using spark in tandem with Hive

2015-12-01 Thread Jörn Franke
om > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed

Re: how to get counts as a byproduct of a query

2015-12-02 Thread Jörn Franke
I am not sure if I understand, but why this should not be possible using SQL in hive? > On 02 Dec 2015, at 21:26, Frank Luo wrote: > > Didn’t get any response, so trying one more time. I cannot believe I am the > only one facing the problem. > > From: Frank Luo > Sent: Tuesday, December 0

Re: Handling LZO files

2015-12-03 Thread Jörn Franke
How many nodes, cores and memory do you have? What hive version? Do you have the opportunity to use tez as an execution engine? Usually I use external tables only for reading them and inserting them into a table in Orc or parquet format for doing analytics. This is much more performant than jso

Re: Handling LZO files

2015-12-03 Thread Jörn Franke
ORC or PARQUET, requires us to load 5 years of LZO data in ORC or > PARQUET format. Though it might be performance efficient, it increases data > redundancy. > But we will explore that option. > > Currently I want to understand when I am unable to scale up mappers. > > Tha

Re: Handling LZO files

2015-12-04 Thread Jörn Franke
for analytics the ORC or parquet format. > On 03 Dec 2015, at 15:28, Jörn Franke wrote: > > Your Hive version is too old. You may want to use also another execution > engine. I think your problem might then be related to external tables for > which the parameter you set probably

Re: Hive Support for Unicode languages

2015-12-04 Thread Jörn Franke
What operating system are you using? > On 04 Dec 2015, at 01:25, mahender bigdata > wrote: > > Hi Team, > > Does hive supports Hive Unicode like UTF-8,UTF-16 and UTF-32. I would like to > see different language supported in hive table. Is there any serde which can > show exactly japanese, ch

Re: Loading data from HDFS to hive and leading to many NULL value in hive table

2015-12-15 Thread Jörn Franke
You forgot to tell Hive that the file is comma-separated. You may want to use the CSV serde. > On 16 Dec 2015, at 07:15, zml张明磊 wrote: > > I am confusing about the following result. Why the hive table has so many > NULL value ? > > hive> select * from managers; > OK > fergubo01m,BS1,31,20,10

Re: Error

2015-12-16 Thread Jörn Franke
Do you have the create table statement? The sqoop command ? > On 17 Dec 2015, at 07:13, Trainee Bingo wrote: > > Hi All, > > I have a sqoop script which brings data from oracle and dumps it to HDFS. > Then that data is exposed to hive external table. But when I do : > hive> select * from ; >

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Jörn Franke
Hive has the export/import commands, alternatively Falcon+oozie > On 17 Dec 2015, at 17:21, Elliot West wrote: > > Hello, > > I'm thinking about the steps required to repeatedly push Hive datasets out > from a traditional Hadoop cluster into a parallel cloud based cluster. This > is not a one

Re: The advantages of Hive/Hadoop comnpared to Data Warehouse

2015-12-18 Thread Jörn Franke
I think you should draw more the attention that Hive is just one component in the ecosystem. You can have many more components, such as ELT, integrating unstructured data, machine learning, streaming data etc. however usually analysts are not aware about the technologies and it staff is not much

Re: Executor getting killed when running Hive on Spark

2015-12-24 Thread Jörn Franke
Have you checked what the issue is with the log file causing troubles? Enough space available? Access rights (what is the user of the spark worker?)? Does directory exist? Can you provide more details how the table is created? Does the query work with mr or tez as an execution engine? Does a n

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark. I assume you use the latest Hive version. Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use t

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
27;1451429900') | > > ; > > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it imm

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Jörn Franke
> Mich Talebzadeh >> >> >> >> Sybase ASE 15 Gold Medal Award 2008 >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >&g

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-31 Thread Jörn Franke
You are using an old version of Spark and it cannot leverage all optimizations of Hive, so I think that your conclusion cannot be as easy as you might think. > On 31 Dec 2015, at 19:34, Mich Talebzadeh wrote: > > Ok guys. > > I have not succeeded in installing TEZ. Yet so I can try the query

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Jörn Franke
You can still use execution Engine mr for maintaining the index. Indeed with the ORC or parquet format there are min/max indexes and bloom filters, but you need to sort your data appropriately to benefit from performance. Alternatively you can create redundant tables sorted in different order. T

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Jörn Franke
Btw this is not Hive specific, but also for other relational database systems, such as Oracle Exadata. > On 05 Jan 2016, at 20:57, Jörn Franke wrote: > > You can still use execution Engine mr for maintaining the index. Indeed with > the ORC or parquet format there are min/max

Re: Indexes in Hive

2016-01-05 Thread Jörn Franke
If I understand you correctly this could be just another Hive storage format. > On 06 Jan 2016, at 07:24, Mich Talebzadeh wrote: > > Hi, > > Thinking loudly. > > Ideally we should consider a totally columnar storage offering in which each > column of table is stored as compressed value (I disr

Re: Indexes in Hive

2016-01-06 Thread Jörn Franke
I am not sure how much performance one could gain in comparison to ORC or Parquet. They work pretty well once you know how to use them. However, there is still ways to optimize them. For instance, sorting of data is a key factor for these formats to be efficient. Nevertheless, if you have a lot of

Re: Impact of partitioning on certain queries

2016-01-07 Thread Jörn Franke
This observation is correct and it is the same behavior as you see it in other databases supporting partitions. Usually you should avoid many small partitions. > On 07 Jan 2016, at 23:53, Mich Talebzadeh wrote: > > Ok we hope that partitioning improves performance where the predicate is on >

Re: Impact of partitioning on certain queries

2016-01-08 Thread Jörn Franke
recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > re

Re: Impact of partitioning on certain queries

2016-01-08 Thread Jörn Franke
Try explain dependency > On 08 Jan 2016, at 10:47, Mich Talebzadeh wrote: > > Thanks Gopal. > > Basically the following is true: > > 1.The storage layer is HDFS > 2.The execution engine is MR, Tez, Spark etc > 3.The access layer is Hive > > When we say the access layer is Hive,

Re: Impact of partitioning on certain queries

2016-01-08 Thread Jörn Franke
y of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: 08 January 2016 08:49 > To: user@hive.apache.org >

Re: optimize joins in hive 1.2.1

2016-01-18 Thread Jörn Franke
Do you have some data model? Basically modern technologies, such as Hive, but also relational database, suggest to prejoin tables and working on big flat tables. The reason is that they are distributed systems and you should avoid transferring for each query a lot of data between nodes. Hence,

Re: ORC files and statistics

2016-01-19 Thread Jörn Franke
Just be aware that you should insert the data sorted at least on the most discrimating column of your where clause > On 19 Jan 2016, at 17:27, Owen O'Malley wrote: > > It has both. Each index has statistics of min, max, count, and sum for each > column in the row group of 10,000 rows. It also

Re: ORC files and statistics

2016-01-19 Thread Jörn Franke
ote: > > Thanks Owen, > > I got a bit confused comparing ORC with what I know about indexes in > relational databases. Still need to understand it a bit better. > > Regards > > From: Owen O'Malley [mailto:omal...@apache.org] > Sent: 19 January 2016 17:57 > To: user

Re: Importing Oracle data into Hive

2016-01-31 Thread Jörn Franke
Well, you can create an empty Hive table in Orc format and use --hive-override in sqoop Alternatively you can use --hive-import and set hive.default.format I recommend to define the schema properly on the command line, because sqoop detection of formats is based on jdbc (Java) types which is no

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Jörn Franke
Check HiveMall > On 03 Feb 2016, at 05:49, Koert Kuipers wrote: > > yeah but have you ever seen somewhat write a real analytical program in hive? > how? where are the basic abstractions to wrap up a large amount of operations > (joins, groupby's) into a single function call? where are the tool

Re: Optimizing external table structure

2016-02-13 Thread Jörn Franke
How many disk drives do you have / node? Generally one node should have 12 drives not configured as raid and not configured as lvm. Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use Hadoop Archive (HAR). I am not sure about the latest status of Phoeni

Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Jörn Franke
Why should it not be ok if you do not miss any functionality? You can use oozie + hive queries to have more sophisticated logging and scheduling. Do not forget to do proper capacity/queue management. > On 16 Feb 2016, at 07:19, Ramasubramanian > wrote: > > Hi, > > Is it ok to build an entire

Re: Hive 2 performance

2016-02-24 Thread Jörn Franke
I am not sure what you are looking for. Performance has many influence factors... > On 24 Feb 2016, at 18:23, Mich Talebzadeh > wrote: > > Hi, > > > > Has anyone got some performance matrix for Hive 2 from user perspective? > > It looks very impressive on ORC tables. > > thanks > > --

Re: Hive 2 performance

2016-02-24 Thread Jörn Franke
how fast it returns the results in this case compare to 1.2.1 etc > > thanks > >> On 24/02/2016 17:25, Jörn Franke wrote: >> >> I am not sure what you are looking for. Performance has many influence >> factors... >> >>> On 24 Feb 2016, at 18:23, Mich

Re: Hive and Impala

2016-03-02 Thread Jörn Franke
I think you can always make a benchmark that has this and this result. You always have to see what is evaluated and generally I recommend to always try yourself for your data and your queries. There is also a lot of change within the projects. Impala may have Kudo, but Hive has ORC, Tez and Spa

Re: Hive and Impala

2016-03-02 Thread Jörn Franke
It always depends on what you want to do and thus from experience I cannot agree with your comment. Do you have any reasoning for this statement? > On 02 Mar 2016, at 19:14, Dayong wrote: > > Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive > metadata store can be de

Re: read-only mode for hive

2016-03-08 Thread Jörn Franke
What is the use case? You can try security solutions such as Ranger or Sentry. As already mentioned another alternative could be a view. > On 08 Mar 2016, at 21:09, PG User wrote: > > Hi All, > I have one question about putting hive in read-only mode. > > What are the ways of putting hive in r

Re: Hive Context: Hive Metastore Client

2016-03-09 Thread Jörn Franke
Apache Knox for authentication makes sense. For Hive authorization there are tools such as Apache ranger or Sentry, which themselves can connect via LDAP. > On 09 Mar 2016, at 16:58, Alan Gates wrote: > > One way people have gotten around the lack of LDAP connectivity in HS2 has > been to use

Re: Hive_CSV

2016-03-09 Thread Jörn Franke
Why Don't you load all data and use just two columns for querying? Alternatively use regular expressions. > On 09 Mar 2016, at 18:43, Ajay Chander wrote: > > Hi Everyone, > > I am looking for a way, to ignore the first occurrence of the delimiter while > loading the data from csv file to h

Re: Hive_CSV

2016-03-09 Thread Jörn Franke
The data is already in the csv so it is not matter for querying. It is recommend to convert it to ORC or Parquet for querying. > On 09 Mar 2016, at 19:09, Ajay Chander wrote: > > Daniel, thanks for your time. Is it like creating two tables, one is to get > all the data and the another one is t

Re: ODBC drivers for Hive 2

2016-03-10 Thread Jörn Franke
Just out of curiosity: what is the code base for the odbc drivers by Hortonworks, cloudera & co? Did they develop them on their own? If yes, maybe one should think about an open source one, which is reliable and supports a richer set of Odbc functionality. Especially in the light of Orc,parque

Re: Hive 0.12 MAPJOIN hangs sometimes

2016-03-11 Thread Jörn Franke
Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably your query would execute in less than a minute. If your Hadoop vendor does not support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum including using Orc or parquet as a table format and tez (pref

Re: De-identification_in Hive

2016-03-19 Thread Jörn Franke
What are your requirements? Do you need to omit a column? Transform it? Make the anonymized version joinable etc. there is not simply one function. > On 17 Mar 2016, at 14:58, Ajay Chander wrote: > > Hi Everyone, > > I have a csv.file which has some sensitive data in a particular column in it.

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and row level. > The fi

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
minal_type = 25080; > select * from gprs where terminal_type = 25080; > > In the gprs table, the "terminal_type" column's value is in [0, 25066] > > Joseph > > From: Jörn Franke > Date: 2016-03-16 19:26 > To: Joseph > CC: user; user > Subject: Re

Re: Issue joining 21 HUGE Hive tables

2016-03-23 Thread Jörn Franke
Joining so many external tables is always an issue with any component. Your problem is not Hive specific; but your data model seems to be messed up. First of all you should have them in an appropriate format, such as ORC or parquet and the tables should not be external. Then you should use the r

Re: Hive on Spark engine

2016-03-26 Thread Jörn Franke
If you check the newest Hortonworks distribution then you see that it generally works. Maybe you can borrow some of their packages. Alternatively it should be also available in other distributions. > On 26 Mar 2016, at 22:47, Mich Talebzadeh wrote: > > Hi, > > I am running Hive 2 and now Spar

Re: Hive Metastore Bottleneck

2016-03-30 Thread Jörn Franke
Is the MySQL database virtualized? Bottlenecks to storage of the MySQL database? Network could be a bottleneck? Firewalls blocking new connections in case of a sudden connection increase? > On 30 Mar 2016, at 23:28, Udit Mehta wrote: > > Hi all, > > We are currently running Hive in productio

Re: analyse command not working on decimal(38,0) datatype

2016-04-06 Thread Jörn Franke
Please provide exact log messages , create table statements, insert statements > On 06 Apr 2016, at 12:05, Ashim Sinha wrote: > > Hi Team > Need help for the issue > Steps followed > table created > Loaded the data of lenght 38 in decimal type > Analyse table - for columns gives error like zero

Re: Mappers spawning Hive queries

2016-04-16 Thread Jörn Franke
Just out of curiosity, what is the use case behind this? How do you call the shell script? > On 16 Apr 2016, at 00:24, Shirish Tatikonda > wrote: > > Hello, > > I am trying to run multiple hive queries in parallel by submitting them > through a map-reduce job. > More specifically, I have a

Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke
You could also explore the in-memory database of 12c . However, I am not sure how beneficial it is for Oltp scenarios. I am excited to see how the performance will be on hbase as a hive metastore. Nevertheless, your results on Oracle/SSD will be beneficial for the community. > On 17 Apr 2016,

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Depends really what you want to do. Hive is more for queries involving a lot of data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion. I think the reason is that hive has been the entry point for many engines and formats. Additionally there is a lot of tuning capabilities fr

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Hive has working indexes. However many people overlook that a block is usually much larger than in a relational database and thus do not use them right. > On 19 Apr 2016, at 09:31, Mich Talebzadeh wrote: > > The issue is that Hive has indexes (not index store) but they don't work so > there we

Re: Hive external indexes incorporation into Hive CBO

2016-04-21 Thread Jörn Franke
I am still not sure why you think they are not used. The main issue is that the block size is usually very large (eg 256 MB compared to kilobytes / sometimes few megabytes in traditional databases) and the indexes refer to blocks. This makes it less likely that you can leverage it for small data

Re: Sqoop_Sql_blob_types

2016-04-27 Thread Jörn Franke
You could try as binary. Is it just for storing the blobs or for doing analyzes on them? In the first case you may think about storing them as files in HDFS and including in hive just a string containing the file name (to make analysis on the other data faster). In the later case you should thin

Analyzing Bitcoin blockchain data with Hive

2016-04-29 Thread Jörn Franke
Dear all, I prepared a small Serde to analyze Bitcoin blockchain data with Hive: https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/ There are some example queries, but I will add some in the future. Additionally, more unit tests will be added. Let m

Re: Container out of memory: ORC format with many dynamic partitions

2016-04-29 Thread Jörn Franke
I would still need some time to dig deeper in this. Are you using a specific distribution? Would it be possible to upgrade to a more recent Hive version? However, having so many small partitions is a bad practice which seriously affects performance. Each partition should at least contain several

Re: Making sqoop import use Spark engine as opposed to MapReduce for Hive

2016-04-30 Thread Jörn Franke
I do not think you make it faster by setting the execution engine to Spark. Especially with such an old Spark version. For such simple things such as "dump" bulk imports and exports, it does matter much less if it all what execution engine you use. There was recently a discussion on that on the

Re: Performance for hive external to hbase with serval terabyte or more data

2016-05-11 Thread Jörn Franke
Why don't you export the data from hbase to hive, eg in Orc format. You should not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2). You can then do queries there. For large log file processing in real time, one alternative depending on your needs could be Solr on Hadoop

Re: Query Failing while querying on ORC Format

2016-05-17 Thread Jörn Franke
I do not remember exactly, but I think it worked simply by adding a new partition to the old table with the additional columns. > On 17 May 2016, at 15:00, Mich Talebzadeh wrote: > > Hi Mahendar, > > That version 1.2 is reasonable. > > One alternative is to create a new table (new_table) in H

Re: HIVE on Windows

2016-05-18 Thread Jörn Franke
Use a distribution, such as Hortonworks > On 18 May 2016, at 19:09, Me To wrote: > > Hello, > > I want to install hive on my windows machine but I am unable to find any > resource out there. I am trying to set up it from one month but unable to > accomplish that. I have successfully set up

Re: Hive and XML

2016-05-22 Thread Jörn Franke
XML is generally slow in any software. It is not recommended for large data volumes. > On 22 May 2016, at 10:15, Maciek wrote: > > Have you had to load XML data into Hive? Did you run into any problems or > experienced any pain points, e.g. complex schemas or performance? > > I have done a lo

Re: Copying all Hive tables from Prod to UAT

2016-05-25 Thread Jörn Franke
Or use Falcon ... The Spark JDBC I would try to avoid. Jdbc is not designed for these big data bulk operations, eg data has to be transferred uncompressed and there is the serialization/deserialization issue query result -> protocol -> Java objects -> writing to specific storage format etc This

Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Jörn Franke
Both have outdated versions, usually one can support you better if you upgrade to the newest. Firewall could be an issue here. > On 26 May 2016, at 10:11, Nikolay Voronchikhin > wrote: > > Hi PySpark users, > > We need to be able to run large Hive queries in PySpark 1.2.1. Users are > runni

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
on use case. >> >> HTH >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
h TEZ) or use Impala instead of Hive > etc as I am sure you already know. > > Cheers, > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com &

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
an email to Hive user group to see anyone has managed to >>> built a vendor independent version. >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Jörn Franke
Thanks very interesting explanation. Looking forward to test it. > On 31 May 2016, at 07:51, Gopal Vijayaraghavan wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference between where LLAP & Spark

Re: Internode Encryption with HiveServer2

2016-06-03 Thread Jörn Franke
This can be configured on the Hadoop level. > On 03 Jun 2016, at 10:59, Nick Corbett wrote: > > Hi > > > I am deploying Hive in a regulated environment - all data needs to be > encrypted when transferred and at rest. > > > If I run a 'select' statement, using HiveServer2, then a map reduce

Re: Convert date in string format to timestamp in table definition

2016-06-05 Thread Jörn Franke
Never use string when you can use int - the performance will be much better - especially for tables in Orc / parquet format > On 04 Jun 2016, at 22:31, Igor Kravzov wrote: > > Thanks Dudu. > So if I need actual date I will use view. > Regarding partition column: I can create 2 external table

Re: insert query in hive

2016-06-08 Thread Jörn Franke
This is not the recommended way to load large data volumes into Hive. Check the external table feature, scoop, and the Orc/parquet formats > On 08 Jun 2016, at 14:03, raj hive wrote: > > Hi Friends, > > I have to insert the data into hive table from Java program. Insert query > will work in

Re: Hive indexes without improvement of performance

2016-06-16 Thread Jörn Franke
The indexes are based on HDFS blocksize, which is usually around 128 mb. This means for hitting a single row you must always load the full block. In traditional databases this blocksize it is much faster. If the optimizer does not pick up the index then you can query the index directly (it is ju

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Hallo, For no databases (including traditional ones) it is advisable to fetch this amount through jdbc. Jdbc is not designed for this (neither for import nor for export of large data volumes). It is a highly questionable approach from a reliability point of view. Export it as file to HDFS and

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this. > On 20 Jun 2016, at 17:28, Jörn Franke wrote: > > Hallo, > > F

Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-21 Thread Jörn Franke
you saying that the reference command line interface > is not efficiently implemented? :) > > -David Nies > >> Am 20.06.2016 um 17:46 schrieb Jörn Franke : >> >> Aside from this the low network performance could also stem from the Java >> application receiv

Re: if else condition in hive

2016-06-21 Thread Jörn Franke
I recommend you to rethink it as part of a bulk transfer potentially even using separate partitions. Will be much faster. > On 21 Jun 2016, at 13:22, raj hive wrote: > > Hi friends, > > INSERT,UPDATE,DELETE commands are working fine in my Hive environment after > changing the configuration an

Re: loading in ORC from big compressed file

2016-06-22 Thread Jörn Franke
Marcin is correct : either split up the gzip files in smaller files of at least on HDFS block or use bzip2 with block compression. What is the original format of the table? > On 22 Jun 2016, at 01:50, Marcin Tustin wrote: > > This is because a GZ file is not splittable at all. Basically, try

  1   2   >