Reading by positions plain text files

2010-11-30 Thread javivd
Hi all,

Sorry, newbie question:

I have database in a plain text file (could be .txt or .dat, it's the
same) that I need to read in python in order to do some data
validation. In other files I read this kind of files with the split()
method, reading line by line. But split() relies on a separator
character (I think... all I know is that it's work OK).

I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:


VARIABLE NAME   POSITION (COLUMN) IN FILE
var_name_1  123-123
var_name_2  124-125
var_name_3  126-126
..
..
var_name_N  512-513 (last positions)

How can I read this so each position in the file it's associated with
each variable name?

Thanks a lot!!

Javier

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread javivd
On Nov 30, 11:43 pm, Tim Harig  wrote:
> On 2010-11-30, javivd  wrote:
>
> > I have a case now in wich another file has been provided (besides the
> > database) that tells me in wich column of the file is every variable,
> > because there isn't any blank or tab character that separates the
> > variables, they are stick together. This second file specify the
> > variable name and his position:
>
> > VARIABLE NAME      POSITION (COLUMN) IN FILE
> > var_name_1                 123-123
> > var_name_2                 124-125
> > var_name_3                 126-126
> > ..
> > ..
> > var_name_N                 512-513 (last positions)
>
> I am unclear on the format of these positions.  They do not look like
> what I would expect from absolute references in the data.  For instance,
> 123-123 may only contain one byte??? which could change for different
> encodings and how you mark line endings.  Frankly, the use of the
> world columns in the header suggests that the data *is* separated by
> line endings rather then absolute position and the position refers to
> the line number. In which case, you can use splitlines() to break up
> the data and then address the proper line by index.  Nevertheless,
> you can use file.seek() to move to an absolute offset in the file,
> if that really is what you are looking for.

I work in a survey research firm. the data im talking about has a lot
of 0-1 variables, meaning yes or no of a lot of questions. so only one
position of a character is needed (not byte), explaining the 123-123
kind of positions of a lot of variables.

and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.

thank you both and sorry for my english!

J
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-12-03 Thread javivd
On Dec 1, 3:15 am, Tim Harig  wrote:
> On 2010-12-01, javivd  wrote:
>
>
>
> > On Nov 30, 11:43 pm, Tim Harig  wrote:
> >> On 2010-11-30, javivd  wrote:
>
> >> > I have a case now in wich another file has been provided (besides the
> >> > database) that tells me in wich column of the file is every variable,
> >> > because there isn't any blank or tab character that separates the
> >> > variables, they are stick together. This second file specify the
> >> > variable name and his position:
>
> >> > VARIABLE NAME      POSITION (COLUMN) IN FILE
> >> > var_name_1                 123-123
> >> > var_name_2                 124-125
> >> > var_name_3                 126-126
> >> > ..
> >> > ..
> >> > var_name_N                 512-513 (last positions)
>
> >> I am unclear on the format of these positions.  They do not look like
> >> what I would expect from absolute references in the data.  For instance,
> >> 123-123 may only contain one byte??? which could change for different
> >> encodings and how you mark line endings.  Frankly, the use of the
> >> world columns in the header suggests that the data *is* separated by
> >> line endings rather then absolute position and the position refers to
> >> the line number. In which case, you can use splitlines() to break up
> >> the data and then address the proper line by index.  Nevertheless,
> >> you can use file.seek() to move to an absolute offset in the file,
> >> if that really is what you are looking for.
>
> > I work in a survey research firm. the data im talking about has a lot
> > of 0-1 variables, meaning yes or no of a lot of questions. so only one
> > position of a character is needed (not byte), explaining the 123-123
> > kind of positions of a lot of variables.
>
> Then file.seek() is what you are looking for; but, you need to be aware of
> line endings and encodings as indicated.  Make sure that you open the file
> using whatever encoding was used when it was generated or you could have
> problems with multibyte characters affecting the offsets.

Ok, I will try it and let you know. Thanks all!!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-12-12 Thread javivd
On Dec 1, 7:15 am, Tim Harig  wrote:
> On 2010-12-01, javivd  wrote:
>
>
>
>
>
>
>
>
>
> > On Nov 30, 11:43 pm, Tim Harig  wrote:
> >> On 2010-11-30, javivd  wrote:
>
> >> > I have a case now in wich anotherfilehas been provided (besides the
> >> > database) that tells me in wich column of thefileis every variable,
> >> > because there isn't any blank or tab character that separates the
> >> > variables, they are stick together. This secondfilespecify the
> >> > variable name and his position:
>
> >> > VARIABLE NAME      POSITION (COLUMN) INFILE
> >> > var_name_1                 123-123
> >> > var_name_2                 124-125
> >> > var_name_3                 126-126
> >> > ..
> >> > ..
> >> > var_name_N                 512-513 (last positions)
>
> >> I am unclear on the format of these positions.  They do not look like
> >> what I would expect from absolute references in the data.  For instance,
> >> 123-123 may only contain one byte??? which could change for different
> >> encodings and how you mark line endings.  Frankly, the use of the
> >> world columns in the header suggests that the data *is* separated by
> >> line endings rather then absolute position and the position refers to
> >> the line number. In which case, you can use splitlines() to break up
> >> the data and then address the proper line by index.  Nevertheless,
> >> you can usefile.seek() to move to an absolute offset in thefile,
> >> if that really is what you are looking for.
>
> > I work in a survey research firm. the data im talking about has a lot
> > of 0-1 variables, meaning yes or no of a lot of questions. so only one
> > position of a character is needed (not byte), explaining the 123-123
> > kind of positions of a lot of variables.
>
> Thenfile.seek() is what you are looking for; but, you need to be aware of
> line endings and encodings as indicated.  Make sure that you open thefile
> using whatever encoding was used when it was generated or you could have
> problems with multibyte characters affecting the offsets.

I've tried your advice and something is wrong. Here is my code,



f = open(r'c:c:\somefile.txt', 'w')

f.write('0123456789\n0123456789\n0123456789')

f.close()

f = open(r'c:\somefile.txt', 'r')


for line in f:
f.seek(3,0)
print f.read(1) #just to know if its printing the rigth column

I used .seek() in this manner, but is not working.

Let me put the problem in another way. I have .txt file with NO
headers, and NO blanks between any columns. But i know that from
columns, say 13 to 15, is variable VARNAME_1 (of course, a three digit
var). How can extract that column in a list call VARNAME_1??

Obviously, this should extend to all the positions and variables i
have to extract from the file.

Thanks!

J
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-12-13 Thread javivd
On Dec 12, 11:21 pm, Dennis Lee Bieber  wrote:
> On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd
>  declaimed the following in
> gmane.comp.python.general:
>
>
>
> > f = open(r'c:c:\somefile.txt', 'w')
>
> > f.write('0123456789\n0123456789\n0123456789')
>
>         Not the most explanatory sample data... It would be better if the
> records had different contents.
>
> > f.close()
>
> > f = open(r'c:\somefile.txt', 'r')
>
> > for line in f:
>
>         Here you extract one "line" from the file
>
> >     f.seek(3,0)
> >     print f.read(1) #just to know if its printing the rigth column
>
>         And here you ignored the entire line you read, seeking to the fourth
> byte from the beginning of the file, andreadingjust one byte from it.
>
>         I have no idea of how seek()/read() behaves relative to line
> iteration in the for loop... Given the small size of the test data set
> it is quite likely that the first "for line in f" resulted in the entire
> file being read into a buffer, and that buffer scanned to find the line
> ending and return the data preceding it; then the buffer position is set
> to after that line ending so the next "for line" continues from that
> point.
>
>         But in a situation with a large data set, or an unbuffered I/O
> system, the seek()/read() could easily result in resetting the file
> position used by the "for line", so that the second call returns
> "456789\n"... And all subsequent calls too, resulting in an infinite
> loop.
>
>         Presuming the assignment requires pulling multiple selected fields
> from individual records, where each record is of the same
> format/spacing, AND that the field selection can not be preprogrammed...
>
> Sample data file (use fixed width font to view):
> -=-=-=-=-=-
> Wulfraed       09Ranger  1915
> Bask Euren     13Cleric  1511
> Aethelwulf     07Mage    0908
> Cwiculf        08Mage    1008
> -=-=-=-=-=-
>
> Sample format definition file:
> -=-=-=-=-=-
> Name    0-14
> Level   15-16
> Class   17-24
> THAC0   25-26
> Armor   27-28
> -=-=-=-=-=-
>
> Code to process (Python 2.5, with minimal error handling):
> -=-=-=-=-=-
>
> class Extractor(object):
>     def __init__(self, formatFile):
>         ff = open(formatFile, "r")
>         self._format = {}
>         self._length = 0
>         for line in ff:
>             form = line.split("\t") #file must be tab separated
>             if len(form) != 2:
>                 print "Invalid file format definition: %s" % line
>                 continue
>             name = form[0]
>             columns = form[1].split("-")
>             if len(columns) == 1:   #single column definition
>                 start = int(columns[0])
>                 end = start
>             elif len(columns) == 2:
>                 start = int(columns[0])
>                 end = int(columns[1])
>             else:
>                 print "Invalid column definition: %s" % form[1]
>                 continue
>             self._format[name] = (start, end)
>             self._length = max(self._length, end)
>         ff.close()
>
>     def __call__(self, line):
>         data = {}
>         if len(line) < self._length:
>             print "Data line is too short for required format: ignored"
>         else:
>             for (name, (start, end)) in self._format.items():
>                 data[name] = line[start:end+1]
>         return data
>
> if __name__ == "__main__":
>     FORMATFILE = "SampleFormat.tsv"
>     DATAFILE = "SampleData.txt"
>
>     characterExtractor = Extractor(FORMATFILE)
>
>     df = open(DATAFILE, "r")
>     for line in df:
>         fields = characterExtractor(line)
>         for (name, value) in fields.items():
>             print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
>         print
>
>     df.close()
> -=-=-=-=-=-
>
> Output from running above code:
> -=-=-=-=-=-
> Field name: 'Armor'             value: '15'
> Field name: 'THAC0'             value: '19'
> Field name: 'Level'             value: '09'
> Field name: 'Class'             value: 'Ranger  '
> Field name: 'Name'              value: 'Wulfraed       '
>
> Field name: 'Armor'             value: '11'
> Field name: 'THAC0'             value: '15'
> Field name: