Re: shuffle the lines of a large file

2005-03-12 Thread paul koelle
Joerg Schuster wrote: Thanks to all. This thread shows again that Python's best feature is comp.lang.python. from comp.lang import python ;) Paul -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-11 Thread Peter Otten
Simon Brunning wrote: > I couldn't resist. ;-) Me neither... > import random > > def randomLines(filename, lines=1): > selected_lines = list(None for line_no in xrange(lines)) > > for line_index, line in enumerate(open(filename)): > for selected_line_index in xrange(lines): >

Re: shuffle the lines of a large file

2005-03-11 Thread Simon Brunning
On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram <[EMAIL PROTECTED]> wrote: > On Tuesday 08 March 2005 15:55, Simon Brunning wrote: > > Ah, but that's the clever bit; it *doesn't* store the whole list - > > only the selected lines. > > But that means that it'll only read several lines from the fi

Re: shuffle the lines of a large file

2005-03-10 Thread Heiko Wundram
On Tuesday 08 March 2005 15:55, Simon Brunning wrote: > Ah, but that's the clever bit; it *doesn't* store the whole list - > only the selected lines. But that means that it'll only read several lines from the file, never do a shuffle of the whole file content... When you'd want to shuffle the fil

Re: shuffle the lines of a large file

2005-03-10 Thread Simon Brunning
On Thu, 10 Mar 2005 14:37:25 +0100, Stefan Behnel <[EMAIL PROTECTED]> > There. Factor 10. That's what I call optimization... The simplest approach is even faster: C:\>python -m timeit -s "from itertools import repeat" "[None for i in range(1)]" 100 loops, best of 3: 2.53 msec per loop C:\>p

Re: shuffle the lines of a large file

2005-03-10 Thread Stefan Behnel
Simon Brunning wrote: On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning wrote: selected_lines = list(None for line_no in xrange(lines)) Just a short note on this line. If lines is really large, its much faster to use from itertools import repeat selected_lines = list(repeat(None, len(lines)))

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning
On Tue, 8 Mar 2005 15:49:35 +0100, Heiko Wundram <[EMAIL PROTECTED]> wrote: > Problem being: if the file the OP is talking about really is 80GB in size, and > you consider a sentence to have 80 bytes on average (it's likely to have less > than that), that makes 10^9 sentences in the file. Now, mult

Re: shuffle the lines of a large file

2005-03-08 Thread Heiko Wundram
On Tuesday 08 March 2005 15:28, Simon Brunning wrote: > This has the advantage that every line had the same chance of being > picked regardless of its length. There is the chance that it'll pick > the same line more than once, though. Problem being: if the file the OP is talking about really is 80

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning
On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning <[EMAIL PROTECTED]> wrote: > On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu wrote: > > As far as I can tell, what you ultimately want is to be able to extract > > a random ("representative?") subset of sentences. > > If this is what's wanted, then p

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning
On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu wrote: > As far as I can tell, what you ultimately want is to be able to extract > a random ("representative?") subset of sentences. If this is what's wanted, then perhaps some variation on this cookbook recipe might do the trick: http://aspn.activest

Re: shuffle the lines of a large file

2005-03-08 Thread Nick Craig-Wood
Raymond Hettinger <[EMAIL PROTECTED]> wrote: > >>> from random import random > >>> out = open('corpus.decorated', 'w') > >>> for line in open('corpus.uniq'): > print >> out, '%.14f %s' % (random(), line), > > >>> out.close() > > sort corpus.decorated | cut -c 18- > corpus.randomized Ve

Re: shuffle the lines of a large file

2005-03-07 Thread Raymond Hettinger
[Joerg Schuster] > I am looking for a method to "shuffle" the lines of a large file. > > I have a corpus of sorted and "uniqed" English sentences that has been > produced with (1): > > (1) sort corpus | uniq > corpus.uniq > > corpus.uniq is 80G large. Since the corpus is huge, the python portion s

Re: shuffle the lines of a large file

2005-03-07 Thread François Pinard
[Heiko Wundram] > Replying to oneself is bad, [...] Not necessarily. :-) -- François Pinard http://pinard.progiciels-bpi.ca -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-07 Thread François Pinard
[Joerg Schuster] > I am looking for a method to "shuffle" the lines of a large file. If speed and space are not a concern, I would be tempted to presume that this can be organised without too much difficulty. However, looking for speed handling a big file, while keeping equiprobability of all po

RE: shuffle the lines of a large file

2005-03-07 Thread Batista, Facundo
Title: RE: shuffle the lines of a large file [Joerg Schuster] #- Thanks to all. This thread shows again that Python's best feature is #- comp.lang.python. QOTW! QOTW! .    Facundo Bitácora De Vuelo: http://www.taniquetil.com.ar/plog PyAr - Python Argentina: http://pyar.decode.c

Re: shuffle the lines of a large file

2005-03-07 Thread Steven Bethard
Joerg Schuster wrote: Thanks to all. This thread shows again that Python's best feature is comp.lang.python. +1 QOTW STeVe -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-07 Thread Joerg Schuster
Thanks to all. This thread shows again that Python's best feature is comp.lang.python. Jörg -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file - filelist.py (0/1)

2005-03-07 Thread TZOTZIOY
On 7 Mar 2005 05:36:32 -0800, rumours say that "Joerg Schuster" <[EMAIL PROTECTED]> might have written: >Hello, > >I am looking for a method to "shuffle" the lines of a large file. [snip] >So, it would be very useful to do one of the following things: > >- produce corpus.uniq in a such a way tha

Re: shuffle the lines of a large file

2005-03-07 Thread Warren Postma
Joerg Schuster wrote: Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help. Any ideas? Why don't you index the file? I would store the byte-offsets of the beginning of each line into an index file. Then you can generate a random number from 1 to Wh

Re: shuffle the lines of a large file

2005-03-07 Thread gry
As far as I can tell, what you ultimately want is to be able to extract a random ("representative?") subset of sentences. Given the huge size of data, I would suggest not randomizing the file, but randomizing accesses to the file. E.g. (sorry for off-the-cuff pseudo python): [adjust 8196 == 2**13

Re: shuffle the lines of a large file

2005-03-07 Thread Richard Brodie
"Joerg Schuster" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > I am looking for a method to "shuffle" the lines of a large file. Of the top of my head: decorate, randomize, undecorate. Prepend a suitable large random number or hash to each line and then use sort. You could prepen

Re: shuffle the lines of a large file

2005-03-07 Thread Heiko Wundram
Replying to oneself is bad, but although the program works, I never intended to use a shelve to store the data. Better to use anydbm. So, just replace: import shelve by import anydbm and lineindex = shelve.open("test.idx") by lineindex = anydbm.open("test.idx","c") Keep the rest as is.

RE: shuffle the lines of a large file

2005-03-07 Thread Alex Stapleton
Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Alex Stapleton Sent: 07 March 2005 14:17 To: Joerg Schuster; python-list@python.org Subject: RE: shuffle the lines of a large file Not tested this, run it (or some derivation thereof) over the output to get increasing randomness

Re: shuffle the lines of a large file

2005-03-07 Thread Eddie Corns
"Joerg Schuster" <[EMAIL PROTECTED]> writes: >Hello, >I am looking for a method to "shuffle" the lines of a large file. >I have a corpus of sorted and "uniqed" English sentences that has been >produced with (1): >(1) sort corpus | uniq > corpus.uniq >corpus.uniq is 80G large. The fact that eve

Re: shuffle the lines of a large file

2005-03-07 Thread Heiko Wundram
On Monday 07 March 2005 14:36, Joerg Schuster wrote: > Any ideas? The following program should do the trick (filenames are hardcoded, look at top of file): ### shuffle.py import random import shelve # Open external files needed for data storage. lines = open("test.dat","r") lineindex = shelve.

RE: shuffle the lines of a large file

2005-03-07 Thread Alex Stapleton
Not tested this, run it (or some derivation thereof) over the output to get increasing randomness. You will want to keep max_buffered_lines as high as possible really I imagine. If shuffle() is too intensize you could itterate over the buffer several times randomly removing and printing lines unti

Re: shuffle the lines of a large file

2005-03-07 Thread Kent Johnson
Joerg Schuster wrote: Hello, I am looking for a method to "shuffle" the lines of a large file. I have a corpus of sorted and "uniqed" English sentences that has been produced with (1): (1) sort corpus | uniq > corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only once in c