(reposted to the list as I originally sent my reply to only Marvin by mistake)
From: Marvin Humphrey [mar...@rectangular.com] > We use a hash set to perform uniquing, but when there are a lot of unique > values, the memory requirements of the SortWriter go very high. > > Assume that we're sorting string data rather than date time values (so the > memory occupied by all those unique values is significant). What algorithms > would you consider for performing the uniquing? We do have a structure that is comparable to the above: An ordered list of Strings, where updates to a Lucene index requires updates to the list. We do this by having a long[] with collator-sorted offsets into a file with all the Strings in insertion order. When a new String is considered, existence is checked in O(log(n)) by binary search. If it is new, it is appended to the file and the offset inserted into the offset-array (arrayCopy). That is only usable for a relatively small amount of inserts, due to the arrayCopy. SSDs are very helpful for String lookup performance here. For building the initial list, we use the terms for a given field, so we know that they are unique: We Append them all in a file and keep track of the offsets, then sort the offset-list based on what they're pointing at. By using caching and mergesort (heapsort _kills_ performance in this scenario, as there is no locality), it performs fairly well. (I'm lying a little about our setup, for sake of brevity) Memory usage excl. caching is 8 bytes * #Strings + housekeeping. By adding another level of indirection and storing the offsets as a file and sort a list of pointers to the offsets (argh), memory requirements can be dropped to 4 bytes * # Strings. That doubles the number of seeks, so I would only recommend it with SSDs and in tight spots. We do have some code for duplicate reduction (long story): When the list is sorted, step through the offsets and compare the Strings for the entries at position x and x+1. If the Strings are equal, set offset[x+1] = x. When the iteration has finished, the offsets only points to unique Strings and the Strings-file contains some non-referenced entries, that can be cleaned up by writing a new Strings-file. Somewhat related, I'm experimenting with a sliding-window approach to sorting Lucene terms, which might be usable in your scenario. It is a fairly clean trade-off between memory and processing time. It is outlined in the thread "heap memory issues when sorting by a string field" and it should be fairly straight-forward to adapt the algorithm to remove duplicates. The main requirement is that it is possible to iterate over the Strings multiple times. In general, I find that the memory-killer in these cases tend to be all the wrapping and not the data themselves. A String has 38+ bytes of overhead and if you're mainly using ASCII, Java's 2-byte char sure means a lot of always-0-bits in memory. I don't know if it is feasible, but making something like class LittleString { private byte[] utf8; LittleString(String s) { utf8 = s.getBytes("utf-8"); } public int hashCode() { ...generate from utf8 } public String toString() { return new String(utf8, "utf-8"); } public boolean equals(Object o) { if (!(o instanceof LittleString) return false; return Arrays.equals(utf8, ((LittleString)o).utf8); } } and storing that in your HashSet instead of the Strings directly ought to cut a fair amount of your memory usage. I don't know how much it would cost in performance though and the HashSet structure itself isn't free. Regards, Toke Eskildsen --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org