huge dictionary -> bsddb/pickle question

2007-06-15 Thread lazy
Hi,

I have a dictionary something like this,

key1=>{key11=>[1,2] , key12=>[6,7] ,   }
For lack of wording, I will call outer dictionary as dict1 and its
value(inner dictionary) dict2 which is a dictionary of small fixed
size lists(2 items)

The key of the dictionary is a string and value is another dictionary
(lets say dict2)
dict2 has a string key and a list of 2 integers.

Im processesing  HUGE(~100M inserts into the dictionary) data.
I tried 2 options both seem to be slower and Im seeking suggestions to
improve the speed. The code is sort of in bits and pieces, so Im just
giving the idea.

1) Use bsddb. when an insert is done, the db will have key1 as key and
the value(i.e db[key1] will be be pickleled value of dict2). after
1000 inserts , I close and open the db ,inorder to flush the contents
to disk. Also when I try to insert a key, if its already present, I
unpickle the value and change something in dict2 and then pickle it
back to the bsddb.

2)Instead of pickling the value(dict2) and storing in bsddb
immediately, I keep the dict1(outer dictionary in memory) and when it
reaches 1000 inserts, I store it to bsddb as before, pickling each
individual value. The advantage of this is, when one insert is done,
if its already present, I adjust the value and I dont need to unpickle
and pickle it back, if its the memory. If its not present in memory, I
will still need to lookup in bsddb

This is not getting to speed even with option 2. Before inserting, I
do some processing on the line, so the bottleneck is not clear to me,
(i.e in processing or inserting to db). But I guess its mainly because
of pickling and unpickling.

Any suggestions will be appreciated :)

-- 
http://mail.python.org/mailman/listinfo/python-list


Newbie question about string(passing by ref)

2007-05-10 Thread lazy
Hi,

I want to pass a string by reference. I understand that strings are
immutable, but Im not
going to change the string in the function, just to aviod the overhead
of copying(when pass-by-value) because the
strings are long and this function will be called over and over
again.
I initially thought of breaking the strings into list and passing the
list instead, but I think there should be an efficient way.

So is there a way to pass a const reference to a string?

Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Newbie question about string(passing by ref)

2007-05-10 Thread lazy
Thanks all.

> the function you pass it to assigns some other value to the variable,
> that's all it's doing: reassigning a local name to point to somewhere
> else in memory.

So, just to make sure even if I return a value, there is no copy done.
Is it correct?
For eg:

def blah:
   long_str=""
   return long_str

my_str=blah() <=== So here there is no copy done but, my_str points to
the same memory where long_str was created.

Thanks again.

On May 10, 2:57 pm, Adam Atlas <[EMAIL PROTECTED]> wrote:
> On May 10, 5:47 pm, Adam Atlas <[EMAIL PROTECTED]> wrote:
>
> > On May 10, 5:43 pm, lazy <[EMAIL PROTECTED]> wrote:
>
> > > I want to pass a string by reference.
>
> > Don't worry, all function parameters in Python are passed by reference.
>
> Actually, just to clarify a little bit if you're understanding "pass
> by reference" in the sense used in PHP, sort of C, etc.: In Python,
> you have objects and names. When I say all function parameters are
> passed by reference, I don't mean you're actually passing a reference
> to the *variable*. (Like in PHP where you pass a variable like &$foo
> and the function can change that variable's value in the caller's
> scope.) You can never pass a reference to a variable in Python. But on
> the other hand, passing a parameter never causes the value to be
> copied. Basically, you have a variable that's a reference to an object
> somewhere in memory, and passing it to a function gives that
> function's scope a new pointer to that same location in memory. So if
> the function you pass it to assigns some other value to the variable,
> that's all it's doing: reassigning a local name to point to somewhere
> else in memory.


-- 
http://mail.python.org/mailman/listinfo/python-list


url question - extracting (2 types of) domains

2007-05-15 Thread lazy
Hi,
Im trying to extract the domain name from an url. lets say I call
it full_domain and significant_domain(which is the homepage domain)

Eg: url=http://en.wikipedia.org/wiki/IPod ,
full_domain=en.wikipedia.org ,significant_domain=wikipedia.org

Using urlsplit (of urlparse module), I will be able to get the
full_domain, but Im wondering how to get significant_domain. I will
not be able to use like counting the number of dots. etc

Some domains maybe like foo.bar.co.in (where significant_domain=
bar.co.in)
I have around 40M url list. Its ok, if I fallout in few(< 1%) cases.
Although I agree that measuring this error rate itself is not clear,
maybe just based on ituition.

Anybody have clues about existing url parsers in python to do this.
Searching online couldnt help me much other than
the urlparse/urllib module.

Worst case is to try to build a table of domain
categories(like .com, .co.il etc and look for it in the suffix rather
than counting dots and just extract the part till the preceding dot),
but Im afraid if I do this, I might miss some domain category.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: url question - extracting (2 types of) domains

2007-05-16 Thread lazy
Thanks.
Hmm, the url list is quite huge(40M). I think it will take a lot of
time,for a whois lookup I guess. But yeah,
thats seems to be a good way. Probably I will try it with a smaller
set (10K) and see the time it takes.
If not, I guess I will just build a table of known
domains(.com,.org,.co.il etc ) and then I can find the
root domain(significant_domain) atleast for those and I hope majority
of them fall into this :)


On May 16, 12:32 am, Michael Bentley <[EMAIL PROTECTED]>
wrote:
> On May 15, 2007, at 9:04 PM, lazy wrote:
>
>
>
> > Hi,
> > Im trying to extract the domain name from an url. lets say I call
> > it full_domain and significant_domain(which is the homepage domain)
>
> > Eg: url=http://en.wikipedia.org/wiki/IPod,
> > full_domain=en.wikipedia.org ,significant_domain=wikipedia.org
>
> > Using urlsplit (of urlparse module), I will be able to get the
> > full_domain, but Im wondering how to get significant_domain. I will
> > not be able to use like counting the number of dots. etc
>
> > Some domains maybe like foo.bar.co.in (where significant_domain=
> > bar.co.in)
> > I have around 40M url list. Its ok, if I fallout in few(< 1%) cases.
> > Although I agree that measuring this error rate itself is not clear,
> > maybe just based on ituition.
>
> > Anybody have clues about existing url parsers in python to do this.
> > Searching online couldnt help me much other than
> > the urlparse/urllib module.
>
> > Worst case is to try to build a table of domain
> > categories(like .com, .co.il etc and look for it in the suffix rather
> > than counting dots and just extract the part till the preceding dot),
> > but Im afraid if I do this, I might miss some domain category.
>
> The best way I know to get an *authoritive* answer is to start with
> the full_domain and try a whois lookup.  If it returns no records,
> drop everything before the first dot and try again.  Repeat until you
> get a good answer -- this is the significant_domain.
>
> hth,
> Michael


-- 
http://mail.python.org/mailman/listinfo/python-list


Berkely Db. How to iterate over large number of keys "quickly"

2007-08-02 Thread lazy
I have a berkely db and Im using the bsddb module to access it. The Db
is quite huge (anywhere from 2-30GB). I want to iterate over the keys
serially.
I tried using something basic like

for key in db.keys()

but this takes lot of time. I guess Python is trying to get the list
of all keys first and probbaly keep it in memory. Is there a way to
avoid this, since I just want to access keys serially. I mean is there
a way I can tell Python to not load all keys, but try to access it as
the loop progresses(like in a linked list). I could find any accessor
methonds on bsddb to this with my initial search.
I am guessing BTree might be a good choice here, but since while the
Dbs were written it was opened using hashopen, Im not able to use
btopen when I want to iterate over the db.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Berkely Db. How to iterate over large number of keys "quickly"

2007-08-02 Thread lazy
Sorry, Just a small correction,

> a way I can tell Python to not load allkeys, but try to access it as
> the loop progresses(like in a linked list). I could find any accessor
> methonds on bsddb to this with my initial search.

I meant, "I couldn't find any accessor methonds on bsddb to do
this(i.e accesing like in a linked list) with my initial search"

> I am guessing BTree might be a good choice here, but since while the
> Dbs were written it was opened using hashopen, Im not able to use
> btopen when I want to iterate over the db.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Berkely Db. How to iterate over large number of keys "quickly"

2007-08-02 Thread lazy
On Aug 2, 1:42 pm, Ian Clark <[EMAIL PROTECTED]> wrote:
> lazy wrote:
> > I have a berkely db and Im using the bsddb module to access it. The Db
> > is quite huge (anywhere from 2-30GB). I want to iterate over the keys
> > serially.
> > I tried using something basic like
>
> > for key in db.keys()
>
> > but this takes lot of time. I guess Python is trying to get the list
> > of all keys first and probbaly keep it in memory. Is there a way to
> > avoid this, since I just want to access keys serially. I mean is there
> > a way I can tell Python to not load all keys, but try to access it as
> > the loop progresses(like in a linked list). I could find any accessor
> > methonds on bsddb to this with my initial search.
> > I am guessing BTree might be a good choice here, but since while the
> > Dbs were written it was opened using hashopen, Im not able to use
> > btopen when I want to iterate over the db.
>
> db.iterkeys()
>
> Looking at the doc for bsddb objects[1] it mentions that "Once
> instantiated, hash, btree and record objects support the same methods as
> dictionaries." Then looking at the dict documentation[2] you'll find the
> dict.iterkeys() method that should do what you're asking.
>
> Ian
>
> [1]http://docs.python.org/lib/bsddb-objects.html
> [2]http://docs.python.org/lib/typesmapping.html


Thanks. I tried using db.first and then db.next for subsequent keys.
seems to be faster. Thanks for the pointers

-- 
http://mail.python.org/mailman/listinfo/python-list