On Tue, Nov 1, 2011 at 7:28 PM, Aaron Turner <synfina...@gmail.com> wrote:
> On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki <tma...@gmail.com> wrote: > > Hello, > > > > I'm new to the list and also to Cassandra. I found it when I was > searching > > for something to replace our busy mysql server. > > > > One of the things we use the server for is filtering IPs based on a list > of > > IP ranges. These ranges can be small and big, and there are about 50k of > > them in the database. > > > > In mysql this is pretty quick: they are stored as integers, and the query > > basically looks like (say ip is the ip we want to find the all the ranges > > for): > > > > select range from rangelist where ip_start<=ip and ip_end>=ip; > > > > I tried to move this schema to Cassandra, but it turned out to be very > slow, > > even with indexes on both columns. Since I also had to have an EQ > expression > > in the query, I added an indexed text field which was the same for all > rows, > > so the query in cassandra was something like this: > > > > select range from rangelist where type='ip' and ip_start<=ip and > ip_end>=ip; > > > > This was very slow, and I imagine it is because it has to scan through > all > > the rows, making the index useless. > > > > The second thing I tried was to just expand the ranges and store > individual > > IPs as the keys to a column family. This is very fast to query, but the > > problem is that I now have over 2.7 million rows, because some of the > ranges > > are quite large. > > > > As the number of ranges could change, this method could be a problem - > > imagine we add a whole A-class range, it would explode into millions of > > rows. > > > > My question is, is there a more sane way to store this information, while > > still being able to find all the IP ranges that have the given IP in > them? > > > > I've been only dealing with Cassandra for a week or two, so I don't know > > about the inner details of what can be done, but I do have programming > > experience and am not afraid to get my hands dirty, in case it can be > solved > > by writing some extension to Cassandra. > > So, this is just off the top of my head, and I'm not an expert, but > perhaps this will give you some ideas: > > I'm assuming you're talking IPv4. If it's IPv6, you'll need to do some > tweaking > > Create a CF with RowKey of AsciiType and Column Names of LongType. > Values are BytesType. Basically you create one Row for each /8 and > name the row with the network address of that row. So you'd have: > > 1.0.0.0 > 2.0.0.0 > 3.0.0.0 > > etc as row keys > > Then for every /16 under that, store a single name/value pair where > the name is the inet_aton encoded value of the IP address of the /16 > network and the value is a bitmask representing the IP's of the /16. > Set 1 = filter, 0 = don't filter. Basically the first bit would be > 0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc. > > So basically you'll have: > 255 rows > each row with 255 columns > each column would store 8K bytes (since 2^16/8 = 8K) > > Alternatively, you could store 16K columns per row (each column is a > /24) and each column would have 8 bytes. Off the top of my head I'm > not sure which would be faster, but the first solution would be more > disk space efficient. If you need to update your bitmasks regularly, > you're probably better off with the second solution. > > Wrap a little API around this and you have fast and direct access to > know if a given IP should be filtered or not. > > Thanks, these are also good suggestions, but I'll go with Brandon's suggestion first (as it seems to be more suitable for my scenario - I not only need to know if the IP needs to be filtered, but also what range was it found in. -- Tamas Marki