On Tue, Mar 22, 2011 at 4:00 PM, Avram <aav...@me.com> wrote:
> Hi,
>
> I (still) consider myself new to clojure.  I am trying to read a 37Mb
> file that will grow 500k every 2 days. I don't consider this to be
> input large enough file to merit using Hadoop and I'd like to process
> it in Clojure in an efficient, speedy, and idiomatic way.
>
> I simply want something akin to a transpose, where the input looks
> like this:
> ( [ a1 b1 c1 d1 ] [ a2 b2 c2 d2 ] [ a3 b3 c3 d3 ])
>
> …and the output looks like this:
>
> [ [ a1 a2 a3 ] [ b1 b2 b3 ] [ c1 c2 c3 ] [ d1 d2 d3 ] ]
>
> Gleaning what I can from various sources and cobbling them together, I
> have the following below, which works for small input but not for the
> intended file sizes (and larger) I'd like it to be able to handle.

You'll need to avoid holding onto the head of your line-seq, which
means you'll need to make multiple passes over the data, one for the
as, one for the bs, and etc., with the output a lazy seq of lazy seqs.

> (defn data-lines
>    "Returns data lines in file (i.e. all lines that do not start with
> '#')
>      Returns: sequence containing data lines"
>    [filename]
>    (drop-while is-comment? (line-seq (reader filename))))

The description doesn't match the function, unless it's guaranteed
that no line will start with # after the first line that doesn't do
so. You may want remove instead of drop-while here, or to change the
doc string.

> Also, I'd prefer to read in gzip'd tab-delimited files instead of
> uncompressed tab-delimited files.  What is the idiomatic clojure way
> to do this?

There are zip functions in the Java standard library. I don't know if
they can handle gzip, or just pkzip. In the worst case, you'd have no
library you could use. Even then, it could be done in at least two
ways.

1. Use Runtime/exec to call shell tools to gunzip the file to a
   temporary file for processing.

2. Read at wikipedia and implement gunzip in Clojure, using byte arrays
   and whatever other tools you'd need to work with binary data at a low
   level, and/or Java's ByteBuffer and related classes.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to