Hello,

I'm trying to understand how memory is allocated and collected when working with streams. I recently asked a question about how to limit memory when using streams on Stackoverflow
and got two good answers:
http://stackoverflow.com/questions/18629188/how-to-limit-memory-use-when-using-a-stream

However, I'm seeking a better understanding than could really be given through the SO format. I want to use streams because I have too much data to fit in memory and hence want to use them to bring in the data from files and databases sequentially as needed. However, I'm finding that the GC is not collecting as I would have hoped and hence streams are not quite as straightforward a solution as I expected. The sort of problems that I am experiencing are demonstrated with the
following code

  #lang racket
  (require rackunit)

; This program fails with out of memory errors when memory limit set to 128mb ; It always fails when it comes to testing filtered-nums, regardless of how test-nums? ; and test-gen-nums? have been set. However test-for/sum-gen-filtered-nums?
  ; also fails if set.

  (define max-num 10000000)
  (define test-nums? #f)
  (define test-gen-filtered-nums? #f)
  (define test-for/sum-gen-filtered-nums? #f)

  (define nums (in-range max-num))
  (define filtered-nums
    (stream-filter (? (i) (values #t)) nums))

  (define (gen-filtered-nums)
    (stream-filter (? (i) (values #t)) nums))

  (when test-nums?
    (displayln "Testing nums")
    (check-equal? max-num (stream-length nums)))

  (when test-gen-filtered-nums?
    (displayln "Testing gen-filtered-nums")
    (check-equal? max-num (stream-length (gen-filtered-nums))))

  (when test-for/sum-gen-filtered-nums?
    (displayln "Testing with for/sum-gen-filtered-nums ")
    (check-equal? max-num (for/sum ([i (gen-filtered-nums)]) 1)))


  (displayln "Testing filtered-nums")
  (check-equal? max-num (stream-length filtered-nums))


I understand that making multiple passes through a big data is inefficient,
but here I am trying to gain a better understanding of the GC. So this leads
me to a few related questions:

  i.  Why does the GC seem to collect more effectively when the stream is
      created in a function as opposed to in a straight definition? i.e
      test-gen-filtered-nums? passes, although I note that
      test-for/sum-gen-filtered-nums? doesn't.
  ii.  Is stream-filter inappropriate to use with big data sets?
iii. Is there a better choice than streams for dealing with big data sets, coming from
      disparate sources such as files, databases, etc,  within racket?


Thanks



Lorry


--
vLife Systems Ltd
Registered Office: The Meridian, 4 Copthall House, Station Square, Coventry, 
CV1 2FL
Registered in England and Wales No. 06477649
http://vlifesystems.com

____________________
  Racket Users list:
  http://lists.racket-lang.org/users

Reply via email to