RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Thanks Ryan, Very useful to know indeed Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/fi

RE: Loading data containing newlines

2016-01-15 Thread Ryan Harris
Mich, if you have a toolpath that you can use to pipeline the required edits to the source file, you can use a chain similar to this: hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII | sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put - /etl/${table_name

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Interesting points. I think we are now moving to another abstraction layer. Recall that all these extra features (spark, scala etc) require learning curve and potentially additional skill sets that in practice may not be a viable option for many companies who have a lot of investment in

Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
Hive requires you to provide table schema even if you create table based on folder having ORC files (ORC file already has schema internally). It's a shame because ORC is Hive internal project originally Spark can create table based on ORC or Parquet files automatically without asking you to provid

Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
Probably Bryan can try both Hive and Spark and decide which one better works for him. The fact is - lots of companies migrate from Hadoop/Hive to Spark if you like writing ETL using Spark API the you can use map, reduceByKey, groupByKeym, join, distinct, etc API if you like using SQL then you can

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Ok but I believe there are other similar approaches. I can take a raw csv file and customize it using existing shell commands like sed, awk, cut, grep etc among them getting rid of blank lines or replacing silly characters. Bottom line I want to “eventually” store that csv file in a hive

Re: Loading data containing newlines

2016-01-15 Thread Gopal Vijayaraghavan
> You can open a file as an RDD of lines, and map whatever custom >tokenisation function you want over it; That's what a SerDe does in Hive (like OpenCSVSerDe). Once your record gets split into multiple lines, then the problem becomes more complex since Spark's functional nature demands side-eff

Re: Loading data containing newlines

2016-01-15 Thread Marcin Tustin
You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size and use map_partitions to map the standard python csv parser over the partitions. In general, the advantage of spark is that you can do

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Hi Marcin, Can you be specific in what way Spark is better suited for this operation compared to Hive? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the

Re: Loading data containing newlines

2016-01-15 Thread Marcin Tustin
I second this. I've generally found anything else to be disappointing when working with data which is at all funky. On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov wrote: > Time to use Spark and Spark-Sql in addition to Hive? > It's probably going to happen sooner or later anyway. > > I sen

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Hi Bryan, Thanks for this detailed explanation. We have also experimented with importing bzip2 files and Hive is pretty good at handling them. We also need to negotiate empty lines and columns defined as currencies. I still think that mapping external table to raw files (not filtered) i

Parquet Writer : Configuring the no of row groups per each file

2016-01-15 Thread rahul challapalli
Hi, I am trying to control the no of row groups written to each parquet file when we do an "INSERT INTO" query. Can this be configured? - Rahul

RE: Importing currency figures with currency indicator to Hive from CSV

2016-01-15 Thread Mich Talebzadeh
OK this is a convoluted way of doing it using SUBSTR and REGEXP_REPLACE UDF to get rid of ‘?’ and commas in the curremcy imported from CSV file 0: jdbc:hive2://rhes564:10010/default> select net, cast(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(10,2)) AS net_in_currency from t2;

RE: Importing currency figures with currency indicator to Hive from CSV

2016-01-15 Thread Mich Talebzadeh
Hi, What is the equivalent of Oracle TO_NUMBER function in Hive please? Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Syb

Importing currency figures with currency indicator to Hive from CSV

2016-01-15 Thread Mich Talebzadeh
Hi, How to do convert a column stored as String in Hive into Decimal if possible. The excel looks like this Invoice Number Payment date Net VAT Total 360 10/02/2014 £10,000.00 £2000.00 £12,000.00 And the file (before bzip2) looks like this on imported directory in Linu

Insert map and other complex types in hive using jdbc

2016-01-15 Thread Srikrishan Malik
Hello, I have a java Map (Map) and a JDBC connection to hive server. The schema of the table at the server contains a column of type Map. Is it possible to insert the java Map to the hive table column with similar datatype using JDBC? I tried: "create table test(key string, value Map)" "insert i

RE: Converting date format and excel money format in Hive table

2016-01-15 Thread Mich Talebzadeh
Thanks that solved data conversion. How does one replace ?10,000.00 with £10,000.00 ? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Fi

Re: Converting date format and excel money format in Hive table

2016-01-15 Thread matshyeq
try: select cast(unix_timestamp('02/10/2014', 'dd/MM/')*1000 as timestamp); Kind Regards ~Maciek On 15 January 2016 at 10:15, Mich Talebzadeh wrote: > Hi, > > > > > > I am importing an excel sheet saved as csv file comma separated and > compressed with bzip2 into Hive as external table with b

Re: Using python for hive table quering gives error

2016-01-15 Thread Gopal Vijayaraghavan
> i could find all examples using pyhs2 only. The pyhs2 site points to dropbox/pyhive. But even that doesn't work for me unless I replace the TCLIService dir with the one generated by thrift-0.9.2 (after HIVE-8829). Maybe you might have better luck, depending on the exact version of HiveServer2

Re: Using python for hive table quering gives error

2016-01-15 Thread Karimkhan Pathan
So what is the correct way to do this. i could find all examples using *pyhs2 only.* On Fri, Jan 15, 2016 at 3:40 PM, Gopal Vijayaraghavan wrote: > > > import pyhs2 > > ... > > thrift.Thrift.TApplicationException: Required field 'sessionHandle' is > >unset! Struct:TExecuteStatementReq(sessionHan

Converting date format and excel money format in Hive table

2016-01-15 Thread Mich Talebzadeh
Hi, I am importing an excel sheet saved as csv file comma separated and compressed with bzip2 into Hive as external table with bzip2 The excel looks like this Invoice Number Payment date Net VAT Total 360 10/02/2014 £10,000.00 £2000.00 £12,000.00 And the file (before

Re: Using python for hive table quering gives error

2016-01-15 Thread Gopal Vijayaraghavan
> import pyhs2 ... > thrift.Thrift.TApplicationException: Required field 'sessionHandle' is >unset! Struct:TExecuteStatementReq(sessionHandle:null, statement:USE >default, confOverlay:{}) That's a version mismatch in the thrift protocol layer (JDBC to be precise). PyHS2 is deprecated and unmain

Fwd: Using python for hive table quering gives error

2016-01-15 Thread Karimkhan Pathan
I am trying to query hive table with basic example code: *import pyhs2* *with pyhs2.connect(host='dmet-master05.inetu.net ',* * port=1,* * authMechanism='PLAIN',* * user='karim',* * passwor

Group by and FROM_UNIXTIME function

2016-01-15 Thread PICQUENOT Samuel (i-BP - CGI)
Hello, Firstly, the FROM_UNIXTIME function's date pattern is case sensitive : * FROM_UNIXTIME(1451308548, '-MM') --> 2015-12 * FROM_UNIXTIME(1451308548, '-MM') --> 2016-12 (because is not a valid date pattern and 2016 is the current year) Consider the following que