huge dictionary -> bsddb/pickle question
Hi, I have a dictionary something like this, key1=>{key11=>[1,2] , key12=>[6,7] , } For lack of wording, I will call outer dictionary as dict1 and its value(inner dictionary) dict2 which is a dictionary of small fixed size lists(2 items) The key of the dictionary is a string and value is another dictionary (lets say dict2) dict2 has a string key and a list of 2 integers. Im processesing HUGE(~100M inserts into the dictionary) data. I tried 2 options both seem to be slower and Im seeking suggestions to improve the speed. The code is sort of in bits and pieces, so Im just giving the idea. 1) Use bsddb. when an insert is done, the db will have key1 as key and the value(i.e db[key1] will be be pickleled value of dict2). after 1000 inserts , I close and open the db ,inorder to flush the contents to disk. Also when I try to insert a key, if its already present, I unpickle the value and change something in dict2 and then pickle it back to the bsddb. 2)Instead of pickling the value(dict2) and storing in bsddb immediately, I keep the dict1(outer dictionary in memory) and when it reaches 1000 inserts, I store it to bsddb as before, pickling each individual value. The advantage of this is, when one insert is done, if its already present, I adjust the value and I dont need to unpickle and pickle it back, if its the memory. If its not present in memory, I will still need to lookup in bsddb This is not getting to speed even with option 2. Before inserting, I do some processing on the line, so the bottleneck is not clear to me, (i.e in processing or inserting to db). But I guess its mainly because of pickling and unpickling. Any suggestions will be appreciated :) -- http://mail.python.org/mailman/listinfo/python-list
Newbie question about string(passing by ref)
Hi, I want to pass a string by reference. I understand that strings are immutable, but Im not going to change the string in the function, just to aviod the overhead of copying(when pass-by-value) because the strings are long and this function will be called over and over again. I initially thought of breaking the strings into list and passing the list instead, but I think there should be an efficient way. So is there a way to pass a const reference to a string? Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about string(passing by ref)
Thanks all. > the function you pass it to assigns some other value to the variable, > that's all it's doing: reassigning a local name to point to somewhere > else in memory. So, just to make sure even if I return a value, there is no copy done. Is it correct? For eg: def blah: long_str="" return long_str my_str=blah() <=== So here there is no copy done but, my_str points to the same memory where long_str was created. Thanks again. On May 10, 2:57 pm, Adam Atlas <[EMAIL PROTECTED]> wrote: > On May 10, 5:47 pm, Adam Atlas <[EMAIL PROTECTED]> wrote: > > > On May 10, 5:43 pm, lazy <[EMAIL PROTECTED]> wrote: > > > > I want to pass a string by reference. > > > Don't worry, all function parameters in Python are passed by reference. > > Actually, just to clarify a little bit if you're understanding "pass > by reference" in the sense used in PHP, sort of C, etc.: In Python, > you have objects and names. When I say all function parameters are > passed by reference, I don't mean you're actually passing a reference > to the *variable*. (Like in PHP where you pass a variable like &$foo > and the function can change that variable's value in the caller's > scope.) You can never pass a reference to a variable in Python. But on > the other hand, passing a parameter never causes the value to be > copied. Basically, you have a variable that's a reference to an object > somewhere in memory, and passing it to a function gives that > function's scope a new pointer to that same location in memory. So if > the function you pass it to assigns some other value to the variable, > that's all it's doing: reassigning a local name to point to somewhere > else in memory. -- http://mail.python.org/mailman/listinfo/python-list
url question - extracting (2 types of) domains
Hi, Im trying to extract the domain name from an url. lets say I call it full_domain and significant_domain(which is the homepage domain) Eg: url=http://en.wikipedia.org/wiki/IPod , full_domain=en.wikipedia.org ,significant_domain=wikipedia.org Using urlsplit (of urlparse module), I will be able to get the full_domain, but Im wondering how to get significant_domain. I will not be able to use like counting the number of dots. etc Some domains maybe like foo.bar.co.in (where significant_domain= bar.co.in) I have around 40M url list. Its ok, if I fallout in few(< 1%) cases. Although I agree that measuring this error rate itself is not clear, maybe just based on ituition. Anybody have clues about existing url parsers in python to do this. Searching online couldnt help me much other than the urlparse/urllib module. Worst case is to try to build a table of domain categories(like .com, .co.il etc and look for it in the suffix rather than counting dots and just extract the part till the preceding dot), but Im afraid if I do this, I might miss some domain category. -- http://mail.python.org/mailman/listinfo/python-list
Re: url question - extracting (2 types of) domains
Thanks. Hmm, the url list is quite huge(40M). I think it will take a lot of time,for a whois lookup I guess. But yeah, thats seems to be a good way. Probably I will try it with a smaller set (10K) and see the time it takes. If not, I guess I will just build a table of known domains(.com,.org,.co.il etc ) and then I can find the root domain(significant_domain) atleast for those and I hope majority of them fall into this :) On May 16, 12:32 am, Michael Bentley <[EMAIL PROTECTED]> wrote: > On May 15, 2007, at 9:04 PM, lazy wrote: > > > > > Hi, > > Im trying to extract the domain name from an url. lets say I call > > it full_domain and significant_domain(which is the homepage domain) > > > Eg: url=http://en.wikipedia.org/wiki/IPod, > > full_domain=en.wikipedia.org ,significant_domain=wikipedia.org > > > Using urlsplit (of urlparse module), I will be able to get the > > full_domain, but Im wondering how to get significant_domain. I will > > not be able to use like counting the number of dots. etc > > > Some domains maybe like foo.bar.co.in (where significant_domain= > > bar.co.in) > > I have around 40M url list. Its ok, if I fallout in few(< 1%) cases. > > Although I agree that measuring this error rate itself is not clear, > > maybe just based on ituition. > > > Anybody have clues about existing url parsers in python to do this. > > Searching online couldnt help me much other than > > the urlparse/urllib module. > > > Worst case is to try to build a table of domain > > categories(like .com, .co.il etc and look for it in the suffix rather > > than counting dots and just extract the part till the preceding dot), > > but Im afraid if I do this, I might miss some domain category. > > The best way I know to get an *authoritive* answer is to start with > the full_domain and try a whois lookup. If it returns no records, > drop everything before the first dot and try again. Repeat until you > get a good answer -- this is the significant_domain. > > hth, > Michael -- http://mail.python.org/mailman/listinfo/python-list
Berkely Db. How to iterate over large number of keys "quickly"
I have a berkely db and Im using the bsddb module to access it. The Db is quite huge (anywhere from 2-30GB). I want to iterate over the keys serially. I tried using something basic like for key in db.keys() but this takes lot of time. I guess Python is trying to get the list of all keys first and probbaly keep it in memory. Is there a way to avoid this, since I just want to access keys serially. I mean is there a way I can tell Python to not load all keys, but try to access it as the loop progresses(like in a linked list). I could find any accessor methonds on bsddb to this with my initial search. I am guessing BTree might be a good choice here, but since while the Dbs were written it was opened using hashopen, Im not able to use btopen when I want to iterate over the db. -- http://mail.python.org/mailman/listinfo/python-list
Re: Berkely Db. How to iterate over large number of keys "quickly"
Sorry, Just a small correction, > a way I can tell Python to not load allkeys, but try to access it as > the loop progresses(like in a linked list). I could find any accessor > methonds on bsddb to this with my initial search. I meant, "I couldn't find any accessor methonds on bsddb to do this(i.e accesing like in a linked list) with my initial search" > I am guessing BTree might be a good choice here, but since while the > Dbs were written it was opened using hashopen, Im not able to use > btopen when I want to iterate over the db. -- http://mail.python.org/mailman/listinfo/python-list
Re: Berkely Db. How to iterate over large number of keys "quickly"
On Aug 2, 1:42 pm, Ian Clark <[EMAIL PROTECTED]> wrote: > lazy wrote: > > I have a berkely db and Im using the bsddb module to access it. The Db > > is quite huge (anywhere from 2-30GB). I want to iterate over the keys > > serially. > > I tried using something basic like > > > for key in db.keys() > > > but this takes lot of time. I guess Python is trying to get the list > > of all keys first and probbaly keep it in memory. Is there a way to > > avoid this, since I just want to access keys serially. I mean is there > > a way I can tell Python to not load all keys, but try to access it as > > the loop progresses(like in a linked list). I could find any accessor > > methonds on bsddb to this with my initial search. > > I am guessing BTree might be a good choice here, but since while the > > Dbs were written it was opened using hashopen, Im not able to use > > btopen when I want to iterate over the db. > > db.iterkeys() > > Looking at the doc for bsddb objects[1] it mentions that "Once > instantiated, hash, btree and record objects support the same methods as > dictionaries." Then looking at the dict documentation[2] you'll find the > dict.iterkeys() method that should do what you're asking. > > Ian > > [1]http://docs.python.org/lib/bsddb-objects.html > [2]http://docs.python.org/lib/typesmapping.html Thanks. I tried using db.first and then db.next for subsequent keys. seems to be faster. Thanks for the pointers -- http://mail.python.org/mailman/listinfo/python-list