[ https://issues.apache.org/jira/browse/SQOOP-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640442#comment-16640442 ]
Devin G. Bost edited comment on SQOOP-1312 at 10/5/18 11:05 PM: ---------------------------------------------------------------- I was able to reproduce the problem. I just ran this code block (and yes, I know, it's a terribly designed unit test because I didn't use mocks or decouple dependencies, etc., but I just wanted to get somewhere on this because Java is not my primary language), and to my surprise, all of the assertions passed. {quote}@Test public void SplitsCorrectly(){ double minVal = 98.0; double maxVal = 2593466.0; // Use this as a hint. May need an extra task if the size doesn't // divide cleanly. int numSplits = 4; double splitSize = (maxVal - minVal) / (double) numSplits; // splitSize should equal 648342.0 if (splitSize < MIN_INCREMENT) { splitSize = MIN_INCREMENT; } String colName = "TESTCOLUMN"; String lowClausePrefix = colName + " >= "; String highClausePrefix = colName + " < "; double curLower = minVal; double curUpper = curLower + splitSize; List<Pair<String, String>> listOfBounds = new ArrayList<Pair<String, String>>(); while (curUpper < maxVal) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curLower), highClausePrefix + Double.toString(curUpper))); //splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curLower), // highClausePrefix + Double.toString(curUpper))); curLower = curUpper; curUpper += splitSize; } // Catch any overage and create the closed interval for the last split. if (curLower <= maxVal || listOfBounds.size() == 1) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curLower), highClausePrefix + Double.toString(curUpper))); //splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curUpper), // colName + " <= " + Double.toString(maxVal))); } Assert.assertEquals(listOfBounds.get(0).getLeft(), "TESTCOLUMN >= 98.0"); Assert.assertEquals(listOfBounds.get(1).getLeft(), "TESTCOLUMN >= 648440.0"); Assert.assertEquals(listOfBounds.get(2).getLeft(), "TESTCOLUMN >= 1296782.0"); Assert.assertEquals(listOfBounds.get(3).getLeft(), "TESTCOLUMN >= 1945124.0"); Assert.assertEquals(listOfBounds.get(0).getRight(), "TESTCOLUMN < 648440.0"); Assert.assertEquals(listOfBounds.get(1).getRight(), "TESTCOLUMN < 1296782.0"); Assert.assertEquals(listOfBounds.get(2).getRight(), "TESTCOLUMN < 1945124.0"); Assert.assertEquals(listOfBounds.get(3).getRight(), "TESTCOLUMN < 2593466.0"); }{quote} Then I noticed that I accidentally used this line twice: {quote}listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(*curLower*), highClausePrefix + Double.toString(*curUpper*))); {quote} so I replaced the bottom part with: {quote}listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(*curUpper*), highClausePrefix + Double.toString(*maxVal*))); {quote} (Notice the parameters in the second block should be *curUpper* and *maxVal* instead of *curLower* and *curUpper*.) So, I created a second test method (and yes, I know it needs to be refactored to make it cleaner) that matched the current Sqoop code: {quote}public void SplitsCorrectly2(){ double minVal = 98.0; double maxVal = 2593466.0; // Use this as a hint. May need an extra task if the size doesn't // divide cleanly. int numSplits = 4; double splitSize = (maxVal - minVal) / (double) numSplits; // splitSize should equal 648342.0 if (splitSize < MIN_INCREMENT) { splitSize = MIN_INCREMENT; } String colName = "TESTCOLUMN"; String lowClausePrefix = colName + " >= "; String highClausePrefix = colName + " < "; double curLower = minVal; double curUpper = curLower + splitSize; List<Pair<String, String>> listOfBounds = new ArrayList<Pair<String, String>>(); while (curUpper < maxVal) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curLower), highClausePrefix + Double.toString(curUpper))); //splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curLower), // highClausePrefix + Double.toString(curUpper))); curLower = curUpper; curUpper += splitSize; } // Catch any overage and create the closed interval for the last split. if (curLower <= maxVal || listOfBounds.size() == 1) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curUpper), highClausePrefix + Double.toString(maxVal))); //splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curUpper), // colName + " <= " + Double.toString(maxVal))); } Assert.assertEquals(listOfBounds.get(0).getLeft(), "TESTCOLUMN >= 98.0"); Assert.assertEquals(listOfBounds.get(1).getLeft(), "TESTCOLUMN >= 648440.0"); Assert.assertEquals(listOfBounds.get(2).getLeft(), "TESTCOLUMN >= 1296782.0"); Assert.assertEquals(listOfBounds.get(3).getLeft(), "TESTCOLUMN >= 1945124.0"); Assert.assertEquals(listOfBounds.get(0).getRight(), "TESTCOLUMN < 648440.0"); Assert.assertEquals(listOfBounds.get(1).getRight(), "TESTCOLUMN < 1296782.0"); Assert.assertEquals(listOfBounds.get(2).getRight(), "TESTCOLUMN < 1945124.0"); Assert.assertEquals(listOfBounds.get(3).getRight(), "TESTCOLUMN < 2593466.0"); }{quote} and sure enough, it failed with: {quote}java.lang.AssertionError: expected [TESTCOLUMN >= 1945124.0] but found [TESTCOLUMN >= 2593466.0] Expected :TESTCOLUMN >= 1945124.0 Actual :TESTCOLUMN >= 2593466.0 {quote} So, what is the intended purpose of the block (from the original code): {quote}// Catch any overage and create the closed interval for the last split. if (curLower <= maxVal || splits.size() == 1) Unknown macro: \{ splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( lowClausePrefix + Double.toString(*curUpper*), colName + " <= " + Double.toString(*maxVal*))); }{quote} ? was (Author: devin.bost): I was able to reproduce the problem. I just ran this code block (and yes, I know, it's a terribly designed unit test because I didn't use mocks or decouple dependencies, etc., but I just wanted to get somewhere on this because Java is not my primary language), and to my surprise, all of the assertions passed. {quote}@Test public void SplitsCorrectly(){ List<InputSplit> splits = new ArrayList<InputSplit>(); double minVal = 98.0; double maxVal = 2593466.0; // Use this as a hint. May need an extra task if the size doesn't // divide cleanly. int numSplits = 4; double splitSize = (maxVal - minVal) / (double) numSplits; // splitSize should equal 648342.0 if (splitSize < MIN_INCREMENT) { splitSize = MIN_INCREMENT; } String colName = "TESTCOLUMN"; String lowClausePrefix = colName + " >= "; String highClausePrefix = colName + " < "; double curLower = minVal; double curUpper = curLower + splitSize; List<Pair<String, String>> listOfBounds = new ArrayList<Pair<String, String>>(); while (curUpper < maxVal) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curLower), highClausePrefix + Double.toString(curUpper))); // splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curLower), // highClausePrefix + Double.toString(curUpper))); curLower = curUpper; curUpper += splitSize; } // Catch any overage and create the closed interval for the last split. if (curLower <= maxVal || splits.size() == 1) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curLower), highClausePrefix + Double.toString(curUpper))); // splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curUpper), // colName + " <= " + Double.toString(maxVal))); } Assert.assertEquals(listOfBounds.get(0).getLeft(), "TESTCOLUMN >= 98.0"); Assert.assertEquals(listOfBounds.get(1).getLeft(), "TESTCOLUMN >= 648440.0"); Assert.assertEquals(listOfBounds.get(2).getLeft(), "TESTCOLUMN >= 1296782.0"); Assert.assertEquals(listOfBounds.get(3).getLeft(), "TESTCOLUMN >= 1945124.0"); Assert.assertEquals(listOfBounds.get(0).getRight(), "TESTCOLUMN < 648440.0"); Assert.assertEquals(listOfBounds.get(1).getRight(), "TESTCOLUMN < 1296782.0"); Assert.assertEquals(listOfBounds.get(2).getRight(), "TESTCOLUMN < 1945124.0"); Assert.assertEquals(listOfBounds.get(3).getRight(), "TESTCOLUMN < 2593466.0"); } {quote} Then I noticed that I accidentally used this line twice: {quote}listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(*curLower*), highClausePrefix + Double.toString(*curUpper*))); {quote} so I replaced the bottom part with: {quote}listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(*curUpper*), highClausePrefix + Double.toString(*maxVal*))); {quote} (Notice the parameters in the second block should be *curUpper* and *maxVal* instead of *curLower* and *curUpper*.) So, I created a second test method (and yes, I know it needs to be refactored to make it cleaner) that matched the current Sqoop code: {quote}@Test public void SplitsCorrectly2(){ List<InputSplit> splits = new ArrayList<InputSplit>(); double minVal = 98.0; double maxVal = 2593466.0; // Use this as a hint. May need an extra task if the size doesn't // divide cleanly. int numSplits = 4; double splitSize = (maxVal - minVal) / (double) numSplits; // splitSize should equal 648342.0 if (splitSize < MIN_INCREMENT) { splitSize = MIN_INCREMENT; } String colName = "TESTCOLUMN"; String lowClausePrefix = colName + " >= "; String highClausePrefix = colName + " < "; double curLower = minVal; double curUpper = curLower + splitSize; List<Pair<String, String>> listOfBounds = new ArrayList<Pair<String, String>>(); while (curUpper < maxVal) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(curLower), highClausePrefix + Double.toString(curUpper))); //splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curLower), // highClausePrefix + Double.toString(curUpper))); curLower = curUpper; curUpper += splitSize; } // Catch any overage and create the closed interval for the last split. if (curLower <= maxVal || splits.size() == 1) { listOfBounds.add(new Pair<String, String>(lowClausePrefix + Double.toString(*curUpper*), highClausePrefix + Double.toString(*maxVal*))); //splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( // lowClausePrefix + Double.toString(curUpper), // colName + " <= " + Double.toString(maxVal))); } Assert.assertEquals(listOfBounds.get(0).getLeft(), "TESTCOLUMN >= 98.0"); Assert.assertEquals(listOfBounds.get(1).getLeft(), "TESTCOLUMN >= 648440.0"); Assert.assertEquals(listOfBounds.get(2).getLeft(), "TESTCOLUMN >= 1296782.0"); Assert.assertEquals(listOfBounds.get(3).getLeft(), "TESTCOLUMN >= 1945124.0"); Assert.assertEquals(listOfBounds.get(0).getRight(), "TESTCOLUMN < 648440.0"); Assert.assertEquals(listOfBounds.get(1).getRight(), "TESTCOLUMN < 1296782.0"); Assert.assertEquals(listOfBounds.get(2).getRight(), "TESTCOLUMN < 1945124.0"); Assert.assertEquals(listOfBounds.get(3).getRight(), "TESTCOLUMN < 2593466.0"); } {quote} and sure enough, it failed with: {quote}java.lang.AssertionError: expected [TESTCOLUMN >= 1945124.0] but found [TESTCOLUMN >= 2593466.0] Expected :TESTCOLUMN >= 1945124.0 Actual :TESTCOLUMN >= 2593466.0 {quote} So, what is the intended purpose of the block (from the original code): {quote}// Catch any overage and create the closed interval for the last split. if (curLower <= maxVal || splits.size() == 1) { splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( lowClausePrefix + Double.toString(*curUpper*), colName + " <= " + Double.toString(*maxVal*))); } {quote} ? > One of mappers does not load data from mySql if double column is used as > split key > ---------------------------------------------------------------------------------- > > Key: SQOOP-1312 > URL: https://issues.apache.org/jira/browse/SQOOP-1312 > Project: Sqoop > Issue Type: Bug > Affects Versions: 1.4.4, 1.4.6, 1.4.7 > Reporter: Jong Ho Lee > Assignee: Jong Ho Lee > Priority: Major > Attachments: splitter.patch, splitter.patch > > > When we used Sqoop to load data from mySQL using one double column as > split-key in Samsung SDS, > the last mapper did not load data from mySQL at all. > The number of mappers was sometimes increased by 1. > I think they were caused by some bugs in FloatSplitter.java > For the last split, lowClausePrefix + Double.toString(curUpper), may be > lowClausePrefix + Double.toString(curLower). > In while (curUpper < maxVal) loop, because of round-off error, > minVal + splitSize * numSplits can be smaller than maxVal. > Therefore, using for-loop would be better. > Attached is a proposed new FloatSplitter.java > {code} > /** > * Licensed to the Apache Software Foundation (ASF) under one > * or more contributor license agreements. See the NOTICE file > * distributed with this work for additional information > * regarding copyright ownership. The ASF licenses this file > * to you under the Apache License, Version 2.0 (the > * "License"); you may not use this file except in compliance > * with the License. You may obtain a copy of the License at > * > * http://www.apache.org/licenses/LICENSE-2.0 > * > * Unless required by applicable law or agreed to in writing, software > * distributed under the License is distributed on an "AS IS" BASIS, > * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > * See the License for the specific language governing permissions and > * limitations under the License. > */ > // modified by Jongho Lee at Samsung SDS. > package org.apache.sqoop.mapreduce.db; > import java.sql.ResultSet; > import java.sql.SQLException; > import java.util.ArrayList; > import java.util.List; > import org.apache.commons.logging.Log; > import org.apache.commons.logging.LogFactory; > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.mapreduce.InputSplit; > import com.cloudera.sqoop.config.ConfigurationHelper; > import com.cloudera.sqoop.mapreduce.db.DBSplitter; > import com.cloudera.sqoop.mapreduce.db.DataDrivenDBInputFormat; > /** > * Implement DBSplitter over floating-point values. > */ > public class FloatSplitter implements DBSplitter { > private static final Log LOG = LogFactory.getLog(FloatSplitter.class); > private static final double MIN_INCREMENT = 10000 * Double.MIN_VALUE; > public List<InputSplit> split(Configuration conf, ResultSet results, > String colName) throws SQLException { > LOG.warn("Generating splits for a floating-point index column. Due to > the"); > LOG.warn("imprecise representation of floating-point values in Java, > this"); > LOG.warn("may result in an incomplete import."); > LOG.warn("You are strongly encouraged to choose an integral split > column."); > List<InputSplit> splits = new ArrayList<InputSplit>(); > if (results.getString(1) == null && results.getString(2) == null) { > // Range is null to null. Return a null split accordingly. > splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( > colName + " IS NULL", colName + " IS NULL")); > return splits; > } > double minVal = results.getDouble(1); > double maxVal = results.getDouble(2); > // Use this as a hint. May need an extra task if the size doesn't > // divide cleanly. > int numSplits = ConfigurationHelper.getConfNumMaps(conf); > double splitSize = (maxVal - minVal) / (double) numSplits; > if (splitSize < MIN_INCREMENT) { > splitSize = MIN_INCREMENT; > } > String lowClausePrefix = colName + " >= "; > String highClausePrefix = colName + " < "; > double curLower = minVal; > double curUpper = curLower + splitSize; > for (int i = 0; i < numSplits - 1; i++) { > // while (curUpper < maxVal) { // changed to for loop > splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( > lowClausePrefix + Double.toString(curLower), > highClausePrefix + Double.toString(curUpper))); > curLower = curUpper; > curUpper += splitSize; > } > // Catch any overage and create the closed interval for the last split. > if (curLower <= maxVal || splits.size() == 1) { > splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( > lowClausePrefix + Double.toString(curLower), > colName + " <= " + Double.toString(maxVal))); > // curUpper -> curLower // changed > } > if (results.getString(1) == null || results.getString(2) == null) { > // At least one extrema is null; add a null split. > splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit( > colName + " IS NULL", colName + " IS NULL")); > } > return splits; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)