On Sunday, January 31, 2016 at 12:13:31 AM UTC-6, Scotty C wrote:
> > that's what i did. so new performance data. this is with bytes instead of
> > strings for data on the hard drive but bignums in the hash still.
> >
> > as a single large file and a hash with 200
> that's what i did. so new performance data. this is with bytes instead of
> strings for data on the hard drive but bignums in the hash still.
>
> as a single large file and a hash with 203 buckets for 26.6 million
> records the data rate is 98408/sec.
>
> when i split and go with 11 small
just found a small mistake in the documentation: can you find it?
(numerator q) → integer?
q : rational?
Coerces q to an exact number, finds the numerator of the number expressed in
its simplest fractional form, and returns this number coerced to the exactness
of q.
(den
> Yes. You probably do need to convert the files. Your originalat
> coding likely is not [easily] compatible with binary I/O.
that's what i did. so new performance data. this is with bytes instead of
strings for data on the hard drive but bignums in the hash still.
as a single large file and
> i get the feeling that i will need to read the entire file as i used to read
> it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
>
> does that seem right?
never
> i get the feeling that i will need to read the entire file as i used to read
> it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
>
> does that seem right?
never
> However, if you have implemented your own, you can still call
> `equal-hash-code`
yes, my own hash.
i think the equal-hash-code will work.
--
You received this message because you are subscribed to the Google Groups
"Racket Users" group.
To unsubscribe from this group and stop receiving emai
> my plan right now is to rework my current hash so that it runs byte strings
> instead of bignums.
i have a new issue. i wrote my data as char and end records with 'return. i use
(read-line x 'return) and the first record is 15 char. when i use
(read-line-bytes x 'return) i get 23 byte. i have
> > question for you all. right now i use modulo on my bignums. i know i
> > can't do that to a byte string. i'll figure something out. if any of
> > you know how to do this, can you post a method?
> >
>
> I'm not sure what your asking exactly.
i'm talking about getting the hash index of a key.
ok, had time to run my hash on my one test file
'(611 1 1 19 24783208 4.19)
this means
# buckets
% buckets empty
non empty bucket # keys least
non empty bucket # keys most
total number of keys
average number of keys per non empty bucket
it took 377 sec.
original # records is 26570359 so 6.7% d
On Thursday, January 28, 2016 at 11:36:50 PM UTC-6, Brandon Thomas wrote:
> On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote:
> > > I think you understand perfectly.
> > i'm coming around
> >
> > > You said the keys are 128-bit (16 byte) values. You can s
> Way back in this thread you implied that you had extremely large FILES
> containing FIXED SIZE RECORDS, from which you needed
> to FILTER DUPLICATE records based on the value of a FIXED SIZE KEY
> field.
this is mostly correct. the data is state and state associated data on the
fringe. hence th
> I think you understand perfectly.
i'm coming around
> You said the keys are 128-bit (16 byte) values. You can store one key
> directly in a byte string of length 16.
yup
> So instead of using a vector of pointers to individual byte strings,
> you would allocate a single byte string of length
what's been bothering me was trying to get the data into 16 bytes in a byte
string of that length. i couldn't get that to work so gave up and just shoved
the data into 25 bytes. here's a bit of code. i think it's faster than my
bignum stuff.
(define p (bytes 16 5 1 12 6 24 17 9 2 22 4 10 13 18
> You claim you want filtering to be as fast as possible. If that were
> so, you would not pack multiple keys (or features thereof) into a
> bignum but rather would store the keys individually.
chasing pointers? no, you're thinking about doing some sort of byte-append and
subbytes type of thing.
> Is it important to retain that sorting? Or is it just informational?
it's important
> Then you're not using the hash in a conventional manner ... else the
> filter entries would be unique ... and we really have no clue what
> you're actually doing. So any suggestions we give you are shots in
>
On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote:
> What is this other field on which the file is sorted?
this field is the cost in operators to arrive at the key value
> WRT a set of duplicates: are you throwing away all duplicates? Keeping
> the 1st one encountered? Something
ok brandon, that's a thought. build the hash on the hard drive at the time of
data creation. you mention collision resolution. so let me build my hash on the
hard drive using my 6 million buckets but increase the size of each bucket from
5 slots to 20. right? i can't exactly recreate my vector/b
alright george, i'm open to new ideas. here's what i've got going. running 64
bit linux mint OS on a 2 core laptop with 2 gb of ram. my key is 128 bits with
~256 bits per record. so my 1 gb file contains ~63 million records and ~32
million keys. about 8% will be dupes leaving me with ~30 million
gneuner2 (george), you are over thinking this thing. my test data of 1 gb is
but a small sample file. i can't even hash that small 1 gb at the time of data
creation. the hashed data won't fit in ram. at the time i put the redundant
data on the hard drive, i do some constant time sorting so that
neil van dyke, i have used the system function before but had forgotten what it
was called and couldn't find it as a result in the documentation. my problem
with using the system function is that i need 2 versions of it: windoz and
linux. the copy-port function is a write once use across multipl
robby findler, you the man. i like the copy-port idea. i incorporated it and it
is nice and fast and easily fit into the existing code.
--
You received this message because you are subscribed to the Google Groups
"Racket Users" group.
To unsubscribe from this group and stop receiving emails fro
here's what i'm doing. i make a large, say 1 gb file with small records and
there is some redundancy in the records. i will use a hash to identify
duplicates by reading the file back in a record at a time but the file is too
large to hash so i split it. the resultant files (10) are about 100 mb
23 matches
Mail list logo