Joerg Schuster wrote:
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.
from comp.lang import python ;)
Paul
--
http://mail.python.org/mailman/listinfo/python-list
Simon Brunning wrote:
> I couldn't resist. ;-)
Me neither...
> import random
>
> def randomLines(filename, lines=1):
> selected_lines = list(None for line_no in xrange(lines))
>
> for line_index, line in enumerate(open(filename)):
> for selected_line_index in xrange(lines):
>
On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram <[EMAIL PROTECTED]> wrote:
> On Tuesday 08 March 2005 15:55, Simon Brunning wrote:
> > Ah, but that's the clever bit; it *doesn't* store the whole list -
> > only the selected lines.
>
> But that means that it'll only read several lines from the fi
On Tuesday 08 March 2005 15:55, Simon Brunning wrote:
> Ah, but that's the clever bit; it *doesn't* store the whole list -
> only the selected lines.
But that means that it'll only read several lines from the file, never do a
shuffle of the whole file content... When you'd want to shuffle the fil
On Thu, 10 Mar 2005 14:37:25 +0100, Stefan Behnel <[EMAIL PROTECTED]>
> There. Factor 10. That's what I call optimization...
The simplest approach is even faster:
C:\>python -m timeit -s "from itertools import repeat" "[None for i in
range(1)]"
100 loops, best of 3: 2.53 msec per loop
C:\>p
Simon Brunning wrote:
On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning wrote:
selected_lines = list(None for line_no in xrange(lines))
Just a short note on this line. If lines is really large, its much faster
to use
from itertools import repeat
selected_lines = list(repeat(None, len(lines)))
On Tue, 8 Mar 2005 15:49:35 +0100, Heiko Wundram <[EMAIL PROTECTED]> wrote:
> Problem being: if the file the OP is talking about really is 80GB in size, and
> you consider a sentence to have 80 bytes on average (it's likely to have less
> than that), that makes 10^9 sentences in the file. Now, mult
On Tuesday 08 March 2005 15:28, Simon Brunning wrote:
> This has the advantage that every line had the same chance of being
> picked regardless of its length. There is the chance that it'll pick
> the same line more than once, though.
Problem being: if the file the OP is talking about really is 80
On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning
<[EMAIL PROTECTED]> wrote:
> On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu wrote:
> > As far as I can tell, what you ultimately want is to be able to extract
> > a random ("representative?") subset of sentences.
>
> If this is what's wanted, then p
On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu wrote:
> As far as I can tell, what you ultimately want is to be able to extract
> a random ("representative?") subset of sentences.
If this is what's wanted, then perhaps some variation on this cookbook
recipe might do the trick:
http://aspn.activest
Raymond Hettinger <[EMAIL PROTECTED]> wrote:
> >>> from random import random
> >>> out = open('corpus.decorated', 'w')
> >>> for line in open('corpus.uniq'):
> print >> out, '%.14f %s' % (random(), line),
>
> >>> out.close()
>
> sort corpus.decorated | cut -c 18- > corpus.randomized
Ve
[Joerg Schuster]
> I am looking for a method to "shuffle" the lines of a large file.
>
> I have a corpus of sorted and "uniqed" English sentences that has been
> produced with (1):
>
> (1) sort corpus | uniq > corpus.uniq
>
> corpus.uniq is 80G large.
Since the corpus is huge, the python portion s
[Heiko Wundram]
> Replying to oneself is bad, [...]
Not necessarily. :-)
--
François Pinard http://pinard.progiciels-bpi.ca
--
http://mail.python.org/mailman/listinfo/python-list
[Joerg Schuster]
> I am looking for a method to "shuffle" the lines of a large file.
If speed and space are not a concern, I would be tempted to presume that
this can be organised without too much difficulty. However, looking for
speed handling a big file, while keeping equiprobability of all po
Title: RE: shuffle the lines of a large file
[Joerg Schuster]
#- Thanks to all. This thread shows again that Python's best feature is
#- comp.lang.python.
QOTW! QOTW!
. Facundo
Bitácora De Vuelo: http://www.taniquetil.com.ar/plog
PyAr - Python Argentina: http://pyar.decode.c
Joerg Schuster wrote:
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.
+1 QOTW
STeVe
--
http://mail.python.org/mailman/listinfo/python-list
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.
Jörg
--
http://mail.python.org/mailman/listinfo/python-list
On 7 Mar 2005 05:36:32 -0800, rumours say that "Joerg Schuster"
<[EMAIL PROTECTED]> might have written:
>Hello,
>
>I am looking for a method to "shuffle" the lines of a large file.
[snip]
>So, it would be very useful to do one of the following things:
>
>- produce corpus.uniq in a such a way tha
Joerg Schuster wrote:
Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.
Any ideas?
Why don't you index the file? I would store the byte-offsets of the
beginning of each line into an index file. Then you can generate a
random number from 1 to Wh
As far as I can tell, what you ultimately want is to be able to extract
a random ("representative?") subset of sentences. Given the huge size
of data, I would suggest not randomizing the file, but randomizing
accesses to the file. E.g. (sorry for off-the-cuff pseudo python):
[adjust 8196 == 2**13
"Joerg Schuster" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> I am looking for a method to "shuffle" the lines of a large file.
Of the top of my head: decorate, randomize, undecorate.
Prepend a suitable large random number or hash to each
line and then use sort. You could prepen
Replying to oneself is bad, but although the program works, I never intended
to use a shelve to store the data. Better to use anydbm.
So, just replace:
import shelve
by
import anydbm
and
lineindex = shelve.open("test.idx")
by
lineindex = anydbm.open("test.idx","c")
Keep the rest as is.
Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Alex
Stapleton
Sent: 07 March 2005 14:17
To: Joerg Schuster; python-list@python.org
Subject: RE: shuffle the lines of a large file
Not tested this, run it (or some derivation thereof) over the output to get
increasing randomness
"Joerg Schuster" <[EMAIL PROTECTED]> writes:
>Hello,
>I am looking for a method to "shuffle" the lines of a large file.
>I have a corpus of sorted and "uniqed" English sentences that has been
>produced with (1):
>(1) sort corpus | uniq > corpus.uniq
>corpus.uniq is 80G large. The fact that eve
On Monday 07 March 2005 14:36, Joerg Schuster wrote:
> Any ideas?
The following program should do the trick (filenames are hardcoded, look at
top of file):
### shuffle.py
import random
import shelve
# Open external files needed for data storage.
lines = open("test.dat","r")
lineindex = shelve.
Not tested this, run it (or some derivation thereof) over the output to get
increasing randomness.
You will want to keep max_buffered_lines as high as possible really I
imagine. If shuffle() is too intensize
you could itterate over the buffer several times randomly removing and
printing lines unti
Joerg Schuster wrote:
Hello,
I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):
(1) sort corpus | uniq > corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only
once in c
27 matches
Mail list logo