On Fri, 2006-17-03 at 09:24 +1000, Russell Stuart wrote: > On Thu, 2006-03-16 at 08:51 -0500, jamal wrote: > > > BTW, in this example, the new hash I suggested would be as good > > > as the 2.6 case. > > > > > > > Yes, if you used 256 buckets per hash table ;-> > > No - you haven't understood what the new algorithm does.
Ok, so you were talking about the new hash which takes both by "new". You keep confusing me because i am no longer sure if you are still defending the 2.4.x or not. At some point you seem to be and other not. I take it you are no longer saying that for any sets which are <= 8 bits, correct? For the > 8 bits all i am asking you is to show that on a wide range of parameters that the old one is better. > It will get the same performance of the 2.6 version with > the same memory. In fact for all cases where the number > of bits used is <= 8, ie is _identical_ to the 2.6 > algorithm. > Yes, thats a given but what is the point? You showed for one piece of data something was good for the old hash - does that justify a few more instructions? I dont even remember what it was that you showed worked better for the old algorithm. > It is odd that you keep mention IPv6. The IPv6 header > has no fields that are less that 8 bits, and there are > none that are not aligned on a 8 bit boundary. > In fact > even within the IPv6 addresses, there are no sub-fields > less that 8 bits. It would be a happy hunting ground for > the 2.4 algorithm. > I talked about two octets in an IPV6 address. One octet had good entropy in the first 6 bits the other had good entropy in the next 5 bits. It has nothing to do with bit alignment. > Well, this is the crux of the matter, isn't it? You are > apparently used to looking up well known patterns - probably > ones you determine. As such, you can arrange these in nice > logical ranges. > It helps me to optimize for the common. But i didnt invent that technique - it is the rest of the world of people who put together classifiers. I explained why earlier and asked you to go and look at some of the classical papers. > Guess what? The very thing that started this off was me > looking up TCP & UDP port numbers. I didn't determine those > port numbers. They are all over the shop - for me they are > effectively random data. And they are sparse too, as in > they occupy 16 bits. And for that you are entitled to look at the whole 16 bit space. Observing that the data is random is not going to help you. Ensuring that the whole 16 bit range is evenly spread across the buckets is important. > The whole point of the rant that > followed was to explain to you why in a case like that the > 2.6 hash runs substantially slower that the 2.4 one. Whats > more, it can't be fixed by simply trading off more memory. > > But you seem to be burying you head in the sand and saying > "such data sets don't exist". Well they do. And I gave > you a real life one. > I never said it doesnt exist. I said in such cases you need to assume the full range will show up - and then just ensure the buckets are being effectively used. > Here is another one: I have contemplated at times giving > priority to Australian traffic. So I went and found myself > a list of Australian IP addresses. Being allocated by some > central authority, I expected some nice ordered data set. > How naive. There are some 400 subnets, and after trying > to find some pattern I gave up after an hour. Another > random dataset. The 2.4 algorithm will handle that well. > I have been asking for some data now for the last few emails and you keep coming back making bold claims like these without backing them. > When you changed the hash algorithm, you optimised it for > your particular world - at the expense of people who have > data sets like mime. Note that users of 2.4 with your > dataset have a way out - waste a bit of memory and it will > run just as fast. With a large dataset and 2.6 there is > no way out. You are probably doubting this as you are it - > I explain why below. > All i asked for was some data and i would be more than happy to support your scheme;-> And all your emails are refusing to provide me said data. > > a) If i built a hash table with 256 buckets, i can guarantee that > > i will find any host using either scheme in 4 lookups. > > Except 75% of the buckets will not be used by either. > > You miss one point. While you can guarantee it will > happen in 4 lookups, 2.4 averages slightly more than > 1 lookup it if hashes the entire value in one hit. > In 2.6, you are stuck with your 4 lookups, as you say. > I dont know if i followed what you said. If a mask of the whole 8bit was used neither scheme would be any different. They will all result in 1 lookup precisely. But this is about not using the whole 8 bit. > > See the pattern? But this is not the worst part. The nasty part > > is if i built a newer level of hash tables so i can reduce the lookups, > > it totally reduces my effectiveness; i need to figure out which buckets > > are being hit etc. > > The pattern is based on the (unstated) premise that you > are dropping the least significant bits in the byte in > your hashing function. Perhaps you are assuming you know > where these random bits live in the address. Lets please not argue about this. The good hash for something universal will work for a wide range of masks. I can assure you this is valid. > I wasn't > so lucky with my dataset. However lets go with the original > premise - the data is truly random, and you don't know > where bits are. When you drop the least significant bits > 2.4 behaves badly. But since the data is random, you are > better off dropping the upper bits when using 2.4. If > you do that then the 2.4 and 2.6 hashes perform > identically when used as you describe above. > You must have missed the part where i defined what bits had entropy. > In other words, the example didn't show what you intended > it to show. It showed that if you do the optimal thing > for 2.6 and use a tree of hash tables then 2.4 and 2.6 can > be made to perform identically. If you do the optimal > thing for 2.4, then on average 2.4 can be made to run > almost 4 times faster than 2.6, on average. > Where do you pull a number like that? I dare you to show me a dataset where we have > 50K filters where the old algorithm would be faster and use less memory. And for memory I am talking in the 10s to the 100s of Mega bytes. u32 is about making good use of trees of hash tables. Thats the secret sauce. I would never use it any other way. And you seem to be claiming to not even bother using trees of hashes ;-> > As you can probably gather, all we seem to of done here > is concluded that there are two sorts of data sets - > those that are nicely behaved like yours, and those that > aren't - like mine. > You seem to be denying mine exist, > which is a pity. 2.6 works well for yours. 2.4 works > well for mine. > sigh. Ok, I think this is getting excessive. I dont mean to sound rude or unreasonable Russell, but your emails are long and sometimes i loose track as to what you are arguing for. - the 2.4 algorithm performs very poorly for < 8 bits if you use a standard mask for ALL cases except when you use a lot of memory, most of which is _wasted_, in which case it performs equally. There are only 8 masks in an 8 bit ranging from length of 1 to 8. Should not be hard to prove or disprove. Should be very easy to see when you plot. - You made the claim the 2.6 algorithm performs poorly if you had a 16 bit mask. You showed one sample of data. I dont even remember your sample anymore because you keep bombarding me with a lot of claims. I gave you the parameters which will help convince me. I showed you a sample using similar parameters why the old one was buggy in the case of 8 bits. There are only 16 possible masks for 16 bits (of length 1-16). Why cant you just give me similar results? Why do we have to go back and forth with a lot of hand waving when we can settle this easily? If you are unable to do this then I will. I just dont have time until this Sunday. I will not respond to any further emails which do not contain data - instead I am going to produce mine. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html