[jira] [Updated] (SQOOP-3240) Hive table Row count mismatch(more records in hadoop/hive than oracle) when importing data from oracle using sqoop import with direct mode

Suresh Venkatesan (JIRA) Wed, 27 Sep 2017 20:00:46 -0700

     [ 
https://issues.apache.org/jira/browse/SQOOP-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Suresh Venkatesan updated SQOOP-3240:
-------------------------------------
    Description: 
Hive/Hadoop row count was not matching on sqoop import of oracle table with 
direct mode, it's perfectly matching without direct mode.

See the below examples.

1. Row Count is varying based on fetch-size 10000 and direct mode.
2. Row count is not consistent, sometimes it is 630869 and sometimes 639924 
etc.,
3. We are pulling data with direct mode for 1 year with monthly partition, data 
mismatch happening for couple of months not for all months.
4. Hive/Hadoop table Count looks good, If we import the data from oracle 
without direct mode.

Actual row count in oracle table for year 2015 and month 12 is : 609924, we are 
getting more records on sqoop import with direct mode : 630869. hive table 
count is matching with oracle count on sqoop import *without direct* mode : 
609924

*1. sqoop import without direct mode, Row Count : 609924*

sqoop import -Dmapredue.job.queuename=queue 
-Doraoop.import.partitions=oracle_table_partition_201512 --connect 
jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table 
oracle_db.oracle_table --split-by split_column --hive-import --hive-table 
hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN 
TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" 
--fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 

*2. sqoop import with direct mode, Row Count : 630869*

sqoop import -Dmapredue.job.queuename=queue 
-Doraoop.import.partitions=oracle_table_partition_201512 --connect 
jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table 
oracle_db.oracle_table --split-by split_column --hive-import --hive-table 
hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN 
TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" 
--fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 
--direct



  was:
Hive/Hadoop row count was not matching on sqoop import of oracle table with 
direct mode, it's perfectly matching without direct mode.

See the below examples.

1. Row Count is varying based on fetch-size 10000 and direct mode.
2. Row count is not consistent, sometimes it is 630869 and sometimes 639924 
etc.,
3. We are pulling data with direct mode for 1 year with monthly partition, data 
mismatch happening for couple of months not for all months.
4. Hive/Hadoop table Count looks good, If we import the data from oracle 
without direct mode.

Actual row count in oracle table for year 2015 and month 12 is : 609924, we are 
getting more records on sqoop import with direct mode : 630869. hive table 
count is matching with oracle count on sqoop import *without direct* mode : 
609924

*1. sqoop import without direct mode, Row Count : 609924*

sqoop import -Dmapredue.job.queuename=queue 
-Doraoop.import.partitions=oracle_table_partition_201512 --connect 
jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table 
oracle_db.oracle_table --split-by split_column --hive-import --hive-table 
hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN 
TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" 
--fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 

*2. sqoop import with direct mode, Row Count : 630869*

sqoop import -Dmapredue.job.queuename=queue 
-Doraoop.import.partitions=oracle_table_partition_201512 --connect 
jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table 
oracle_db.oracle_table --split-by split_column --hive-import --hive-table 
hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN 
TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" 
--fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 
*--direct*




> Hive table Row count mismatch(more records in hadoop/hive than oracle) when 
> importing data from oracle using sqoop import with direct mode
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-3240
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3240
>             Project: Sqoop
>          Issue Type: Bug
>          Components: connectors/oracle
>    Affects Versions: 1.4.6
>         Environment: Sqoop 1.4.6-cdh5.8.2
> Oracle 11i
> oracle table Sqoop import with direct mode
>            Reporter: Suresh Venkatesan
>            Priority: Minor
>
> Hive/Hadoop row count was not matching on sqoop import of oracle table with 
> direct mode, it's perfectly matching without direct mode.
> See the below examples.
> 1. Row Count is varying based on fetch-size 10000 and direct mode.
> 2. Row count is not consistent, sometimes it is 630869 and sometimes 639924 
> etc.,
> 3. We are pulling data with direct mode for 1 year with monthly partition, 
> data mismatch happening for couple of months not for all months.
> 4. Hive/Hadoop table Count looks good, If we import the data from oracle 
> without direct mode.
> Actual row count in oracle table for year 2015 and month 12 is : 609924, we 
> are getting more records on sqoop import with direct mode : 630869. hive 
> table count is matching with oracle count on sqoop import *without direct* 
> mode : 609924
> *1. sqoop import without direct mode, Row Count : 609924*
> sqoop import -Dmapredue.job.queuename=queue 
> -Doraoop.import.partitions=oracle_table_partition_201512 --connect 
> jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table 
> oracle_db.oracle_table --split-by split_column --hive-import --hive-table 
> hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN 
> TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" 
> --fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 
> *2. sqoop import with direct mode, Row Count : 630869*
> sqoop import -Dmapredue.job.queuename=queue 
> -Doraoop.import.partitions=oracle_table_partition_201512 --connect 
> jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table 
> oracle_db.oracle_table --split-by split_column --hive-import --hive-table 
> hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN 
> TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" 
> --fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 
> --direct



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SQOOP-3240) Hive table Row count mismatch(more records in hadoop/hive than oracle) when importing data from oracle using sqoop import with direct mode

Reply via email to