On Mon, Feb 23, 2009 at 1:57 AM, Etaoin Shrdlu <shr...@unlimitedmail.org> wrote: > On Monday 23 February 2009, 00:31, Mark Knecht wrote: > >> Yeah, that's probably almost usable as it is . I tried it with n=3 and >> n=10. Worked both times just fine. The initial issue might be (as with >> Willie's sed code) that the first line wasn't quite right and required >> some hand editing. I'd prefer not to have to hand edit anything as the >> files are large and that step will be slow. I can work on that. > > But then could you paste an example of such line, so we can see it? The > first line was not special in the sample you posted... > >> As per the message to Willie it would be nice to be able to drop >> columns out but technically I suppose it's not really required. All of >> this is going into another program which must at some level understand >> what the columns are. If I have extra dates and don't use them that's >> probably workable. > > Anyway, it's not difficult to add that feature: > > BEGIN { FS=OFS=","} > { > r=$NF;NF-- > for(i=1;i<n;i++){ > s[i]=s[i+1] > dt[i]=dt[i+1] > if((NR>=n)&&(i==1))printf "%s%s",dt[1],OFS > if(NR>=n)printf "%s%s",s[i],OFS > } > sep=dt[n]="";for(i=1;i<=dropcol;i++){dt[n]=dt[n] sep $i;sep=OFS} > sub("^([^,]*,){"dropcol"}","") > s[n]=$0 > if(NR>=n)printf "%s,%s\n", s[n],r > } > > There is a new variable "dropcol" which contains the number of columns to > drop. Also, for the above to work, you must add the --re-interval > command line switch to awk, eg > > awk --re-interval -v n=4 -v dropcol=2 -f program.awk datafile.csv
Thanks. I'll give that a try later today. I also like Willie's idea about using cut. That seems pretty flexible without any programming. > >> The down side is the output file is 10x larger than the input file - >> roughly - and my current input files are 40-60MB so the output files >> will be 600MB. Not huge but if they grew too much more I might get >> beyond what a single file can be on ext3, right? Isn't that 2GB or so? > > That is strange, the output file could be bigger but not by that > factor...if you don't mind, again could you paste a sample input file > (maybe just some lines, to get an idea...)? > > I'm attaching a small (100 line) data file out of TradeStation. Zipped it's about 2K. It should expand to about 10K. When I run the command to get 10 lines put together it works correctly and gives me a file with 91 lines and about 100K in size. (I.e. - 10x on my disk.) awk -v n=10 -f awkScript1.awk awkDataIn.csv >awkDataOut.csv No mangling of the first line - that must have been something earlier I guess. Sorry for the confusion on that front. One other item has come up as I start to play with this farther down the tool chain. I want to use this data in either R or RapidMiner to data mine for patterns. Both of those tools are easier to use if the first line in the file has column titles. I had originally asked TradeStation not to output the column titles but if I do then for the first line of our new file I should actually copy the first line of the input file N times. Something like For i=1; read line, write N times, write \n and then for i>=2 do what we're doing right now. After I did that I could run it through cut and drop whatever columns I need to drop, I think... ;-) This is great help from you all. As someone who doesn't really program or use the command line too much it's a big advantage. Thanks! Cheers, Mark
awkDataIn.csv.bz2
Description: BZip2 compressed data
awkScript1.awk
Description: Binary data