On 2019-07-10 08:57:29 -0400, kamaraju kusumanchi wrote: > Given a csv file with the following contents > > 20180701, A > 20180702, A, B > 20180703, A, B, C > 20180704, B, C > 20180705, C > > I would like to transform the underlying data into a dataframe such as > > date, A, B, C > 20180701, True, False, False > 20180702, True, True, False > 20180703, True, True, True > 20180704, False, True, True > 20180705, False, False, True > > the idea is that the first field in each line of the csv is the row > index of the dataframe. The subsequent fields will be its column names > and the values in the dataframe tell whether that element is present > or not in the line. > > Is there a name for this transformation?
This type of output is usually called a cross table, but I don't know whether this specific transformation has a name (if you had only one of A, B, and C per line it would be a kind of pivot operation). > Any existing code/library > that can transform data back and forth between the two formats? I can > write one myself if there is none but trying to avoid reinventing the > wheel if possible. I need to produce cross tables frequently, but I never bothered to make it into the library because the part that is common (maintaining two hashes and dumping them) is so much less than the parts which are different (data source and format, what information to extract, output format). The basic idea is that you use a dict of dict of (whatever) to represent your output matrix: row keys are the first level, column keys are the second level. Cell type in your case is bool, so you could use a set instead of a dict of bool. Often you want to keep information about each row (e.g. order of appearance, or a count), so you'll use a second dict for that. For output you get a list of columns in the right order and then iterate over the 1st level keys of your dict and the list of columns to access each cell. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | h...@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list