Hi, Dmitriy Lyubimov
OK, I have submitted a JIRA issue at https://issues.apache.org/jira/browse/MAHOUT-1700 I'm a newbie for mahout, so, what should I do next for this issue? Thank you! At 2015-04-28 02:16:37, "Dmitriy Lyubimov" <[email protected]> wrote: >Thank you for this analysis. I can't immediately confirm this since it's >been a while but this sounds credible. > >Do you mind to file a jira with all this information, and even perhaps do a >PR on github? > >thank you. > >On Mon, Apr 27, 2015 at 4:32 AM, lastarsenal <[email protected]> wrote: > >> Hi, All, >> >> >> Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0) >> job. There's a java heap space out of memory problem in ABtDenseOutJob. I >> found the reason, the ABtDenseOutJob map code is as below: >> >> >> protected void map(Writable key, VectorWritable value, Context context) >> throws IOException, InterruptedException { >> >> >> Vector vec = value.get(); >> >> >> int vecSize = vec.size(); >> if (aCols == null) { >> aCols = new Vector[vecSize]; >> } else if (aCols.length < vecSize) { >> aCols = Arrays.copyOf(aCols, vecSize); >> } >> >> >> if (vec.isDense()) { >> for (int i = 0; i < vecSize; i++) { >> extendAColIfNeeded(i, aRowCount + 1); >> aCols[i].setQuick(aRowCount, vec.getQuick(i)); >> } >> } else if (vec.size() > 0) { >> for (Vector.Element vecEl : vec.nonZeroes()) { >> int i = vecEl.index(); >> extendAColIfNeeded(i, aRowCount + 1); >> aCols[i].setQuick(aRowCount, vecEl.get()); >> } >> } >> aRowCount++; >> } >> >> >> If the input is RandomAccessSparseVector, usually with big data, it's >> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new >> Vector[vecSize] will introduce the OutOfMemory problem. The settlement of >> course should be enlarge every tasktracker's maximum memory: >> <property> >> <name>mapred.child.java.opts</name> >> <value>-Xmx1024m</value> >> </property> >> However, if you are NOT hadoop administrator or ops, you have no >> permission to modify the config. So, I try to modify ABtDenseOutJob map >> code to support RandomAccessSparseVector situation, I use hashmap to >> represent aCols instead of the original Vector[] aCols array, the modified >> code is as below: >> >> >> private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>(); >> protected void map(Writable key, VectorWritable value, Context context) >> throws IOException, InterruptedException { >> >> >> Vector vec = value.get(); >> if (vec.isDense()) { >> for (int i = 0; i < vecSize; i++) { >> //extendAColIfNeeded(i, aRowCount + 1); >> if (aColsMap.get(i) == null) { >> aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, >> 100)); >> } >> aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i)); >> //aCols[i].setQuick(aRowCount, vec.getQuick(i)); >> } >> } else if (vec.size() > 0) { >> for (Vector.Element vecEl : vec.nonZeroes()) { >> int i = vecEl.index(); >> //extendAColIfNeeded(i, aRowCount + 1); >> if (aColsMap.get(i) == null) { >> aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, >> 100)); >> } >> aColsMap.get(i).setQuick(aRowCount, vecEl.get()); >> //aCols[i].setQuick(aRowCount, vecEl.get()); >> } >> } >> aRowCount++; >> } >> >> >> Then the OutofMemory problem is dismissed. >> >> >> Thank you! >> >>
