A data transformation framework. A presentation inviting commentary.

F.R. Wed, 21 Aug 2013 09:34:08 -0700

Hi all,

In an effort to do some serious cleaning up of a hopelessly clutteredworking environment, I developed a modular data transformation systemthat pretty much stands. I am very pleased with it. I expect huge timesavings. I would share it, if had a sense that there is an interest outthere and would appreciate comments. Here's a description. I named themodule TX:

The nucleus of the TX system is a Transformer class, a wrapper for anykind of transformation functionality. The Transformer takes input ascalling argument and returns it transformed. This design allows theassembly of transformation chains, either nesting calls or better, usingthe class Chain, derived from 'Transformer' and 'list'. A Chain consistsof a sequence of Transformers and is functionally equivalent to anindividual Transformer. A high degree of modularity results: Chainsnest. Another consequence is that many transformation tasks can behandled with a relatively modest library of a few basic prefabricatedTransformers from which many different Chains can be assembled on thefly. A custom Transformer to bridge an eventual gap is quickly writtenand tested, because the task likely is trivial.A good analogy of the TX methodology is a road map with townsscattered all over it and highways connecting them. To get from any townto any other one is a simple matter of hopping the towns in between. TheTX equivalent of the towns are data formats, the equivalent of thehighways are TX Transformers. They are not so much thought of in termsof what they do than in terms of the formats they take and give.Designing a library of Transformers is essentially a matter ofestablishing a collection of standard data formats. First the towns,then the highways.A feature of the TX Transformer is that it retains both its inputand output. This makes a Chain a breeze to build progressively, link bylink, and also makes debugging easy: If a Chain doesn't work, Chain.show() reveals the failing link as the first one that has no output. It canbe replaced with a corrected instance, as one would replace a blownfuse. Running the Chain again without input makes it have another try.Parameter passing runs on a track that is completely separate fromthe payload track. Parameters can be set in order to configure a Chainprior to running it, or can be sent at runtime by individualTransformers to its siblings and their progeny. Parameters are keyed andget picked up by those Chain links whose set of pre-defined keysincludes the parameter's key. Unintended pick-ups with coincidentallyshared keys for unrelated parameters can be prevented by addressingparameters to individual Translators.

Below an application example. Five custom classes at the end exemplifythe pattern. I join the post also as attachment, in case someauto-line-wrap messes up this text.


Commentary welcome

Frederic



----------------------------------------------------------------------------------------------------

An example of use: Download historic stock quotes from Yahoo Finance fora given range of dates and a list of symbols, delete a column and addthree, insert the data in a MySQL table. Also write them to temporaryfiles in tabular form for verification."make_quotes_writer ()" returns a custom transformation tree."run_quotes ()" makes such a tree, sets it on a given time range andruns it on a list of symbols.(Since Yahoo publishes the data for downloading, I presume it'sokay to do it this way. This is a demo of TX, however, and should not bemisconstrued as an encouragement to violate any publisher's terms ofservice.)



import TX, yahoo_historic_quotes as yhq

def make_quotes_writer ():

     Visualizer = TX.Chain (
          yhq.percent (),
          TX.Table_Maker (has_header = True),
          TX.Table_Writer (),
          name = 'Visualizer'
     )

To_DB = TX.Chain (yhq.header_stripper(), TX.DB_Writer(table_name ='quotes'), name = 'To DB')


     To_File = TX.Chain (Visualizer, TX.File_Writer (), name = 'To File')

     Splitter = TX.Splitter (To_DB, To_File, name = 'Splitter')

     Quotes = TX.Chain (
          yhq.yahoo_quotes (),
          TX.CSV_To_List (delimiter = ','),
          TX.Numerizer (),
          yhq.wiggle_and_trend (),
          yhq.symbol (),
          Splitter,
          name = 'Quotes'
     )

     return Quotes


>>> Quotes = make_quotes_writer ()
>>> Quotes.show_tree()

Quotes
Quotes[0] - Yahoo Quotes
Quotes[1] - CSV To List
Quotes[2] - Numerizer
Quotes[3] - Wiggle and Trend
Quotes[4] - Symbol
Quotes[5] - Splitter
Quotes[5][0] - To DB
Quotes[5][0][0] - Header Stripper
Quotes[5][0][1] - DB Writer
Quotes[5][1] - To File
Quotes[5][1][0] - Visualizer
Quotes[5][1][0][0] - Percent
Quotes[5][1][0][1] - Table Maker
Quotes[5][1][0][2] - Table Writer
Quotes[5][1][1] - File Writer


def run_quotes (symbols, from_date = '1970-01-01', to_date = '2099-12-31'):
    '''Downloads historic stock quotes from Yahoo Finance.

Inserts data into MySQL table "quotes" relying on index to rejectrepeat insertions.Also writes the data in a table format to a temporary file, one foreach symbol.

    '''
    Quotes = make_quotes_writer ()
    Quotes.set (from_date = from_date, to_date = to_date)
    FW = Quotes [5][1][1]  # File Writer
    for symbol in symbols:
         outfilename = '/tmp/%s' % symbol
         FW.set (terminal = outfilename)
         Quotes.set (symbol = symbol)
         Quotes ()


>>> run_quotes (('MO', 'SU'), '2013-08-10')

Checking:

>>> DBRC = TX.DB_Run_Command ()

>>> for record in DBRC ("select * from quotes where symbol in('MO','SU') and date >= 20130810;"):

        print record

(datetime.date(2013, 8, 16), 'MO', 34.51, 34.63, 34.25, 34.29, 8364900L,0.0110337, -0.00638792)(datetime.date(2013, 8, 15), 'MO', 34.78, 34.93, 34.5, 34.57, 15144400L,0.0123866, -0.00604926)(datetime.date(2013, 8, 14), 'MO', 35.23, 35.25, 34.81, 35.06, 5871900L,0.0125607, -0.00485298)(datetime.date(2013, 8, 13), 'MO', 35.17, 35.27, 35.01, 35.22, 5919700L,0.00739898, 0.00142288)(datetime.date(2013, 8, 12), 'MO', 35.25, 35.28, 35.05, 35.16, 4884300L,0.00654059, -0.00255936)(datetime.date(2013, 8, 16), 'SU', 33.84, 34.63, 33.81, 34.18, 7396600L,0.0239626, 0.00993571)(datetime.date(2013, 8, 15), 'SU', 32.84, 33.99, 32.7, 33.94, 7747100L,0.0386865, 0.0329885)(datetime.date(2013, 8, 14), 'SU', 32.06, 32.91, 32.02, 32.72, 4690200L,0.0274141, 0.0203296)(datetime.date(2013, 8, 13), 'SU', 32.04, 32.18, 31.94, 32.01, 2051900L,0.00748596, -0.000935745)(datetime.date(2013, 8, 12), 'SU', 32.15, 32.35, 31.99, 32.15, 2870500L,0.0111906, 0.0)


>>> FR = TX.File_Reader ()
>>> print FR ('/tmp/SU')

2013-08-16 | SU | 33.84 | 34.63 | 33.81 | 34.18 | 7396600.00 |2.40% | +0.99% |2013-08-15 | SU | 32.84 | 33.99 | 32.70 | 33.94 | 7747100.00 |3.87% | +3.30% |2013-08-14 | SU | 32.06 | 32.91 | 32.02 | 32.72 | 4690200.00 |2.74% | +2.03% |2013-08-13 | SU | 32.04 | 32.18 | 31.94 | 32.01 | 2051900.00 |0.75% | -0.09% |2013-08-12 | SU | 32.15 | 32.35 | 31.99 | 32.15 | 2870500.00 |1.12% | +0.00% |



----------------------------------------------------------------------------------------------------

Each Transformer retains input and output, freezing the state of acomposite after each run. This makes developing step by step anddebugging a breeze. If a Chain fails, the failing link is the first oneto have no output. Chain.show () displays them all in sequence. Thedeficient Transformer can be replaced with a fixed instance and theChain run again without input for another try.


>>> Quotes.replace (Fixed_Numerizer, 2)
>>> catch = Quotes ()  # catch removes the risk of flooding the display


----------------------------------------------------------------------------------------------------

These are the five custom classes. The decorator "setup" updatesparameters and prevents needless reruns comparing time stamps.



class wiggle_and_trend (TX.Transformer):

'''Deletes column "Adj Close" and adds columns "Wiggle" and"Trend". The values areratios: (day's high - day's low) / mean, and (day's close - day'sopen) / mean.

    '''
    name = 'Wiggle and Trend'
    @TX.setup
    def transform (self):
        input = self.Input.data
        output = []
        output.append (tuple (input[0][:-1]) + ('Wiggle', 'Trend'))
        for i in range (1, len (input)):
            date, open, high, low, close, vol, adj = input [i]
            wiggle = high - low
            mean = (high + low) / 2.0
            wiggle_ratio = wiggle / mean
            trend = close - open
            trend_ratio = trend / mean

output.append ((date, open, high, low, close, vol,wiggle_ratio, trend_ratio))

        self.Output.take (output)


class symbol (TX.Transformer):
    'Adds a column Symbol'
    name = 'Symbol'
    def __init__ (self):
        TX.Transformer.__init__ (self, symbol = None)
    @TX.setup
    def transform (self):
        symbol = self.get ('symbol')
        if not symbol:
            self.log ('No symbol!')
        else:
            input = self.Input.data
            output = []
            output.append ((input[0][0], 'Symbol') + input [0][1:])
            for i in range (1, len (input)):
                output.append ((input[i][0], symbol) + input [i][1:])
            self.Output.take (output)


class percent (TX.Transformer):
    'Converts float ratio to percent for better legibility'
    name = 'Percent'
    @TX.setup
    def transform (self):
        input = self.Input.data
        output = [input [0]]
        for i in range (1, len (input)):
            wiggle = '%5.2f%%' % (input [i][7] * 100.0)
            trend =  '%+5.2f%%' % (input [i][8] * 100.0)
            output.append (input [i][:7] + (wiggle, trend))
        self.Output.take (output)


class header_stripper (TX.Transformer):
    'Header names are not meant for insertion into data base table'
    name = 'Header Stripper'
    @TX.setup
    def transform (self):
        self.Output.take (self.Input.data [1:])



class yahoo_quotes (TX.WWW_Reader):

    'Gets historic stock quotes from Yahoo Finance'

    import urllib
    name = 'Yahoo Quotes'
    URL_TRUNK = 'http://ichart.finance.yahoo.com/table.csv'

def __init__ (self, from_date = '1970-01-01', to_date ='2099-12-31', **keywords):

      private_keys = {
        'symbol' : None,
        'from_date' : from_date,
        'to_date' : to_date,
        'url_trunk' : self.URL_TRUNK,
      }
      private_keys.update (keywords)
      TX.WWW_Reader.__init__ (self, **private_keys)

    def make_url (self):
        fy, fm, fd = self.get ('from_date').split ('-')
        ty, tm, td = self.get ('to_date').split ('-')
        fm = '%0d' % (int (fm) - 1)
        tm = '%0d' % (int (tm) - 1)
        symbol = self.get ('symbol')
        qs = (
            ('s',symbol),
            ('a',fm),('b',fd),('c',fy),
            ('d',tm),('e',td),('f',ty),
            ('g','d'),
            ('ignore', '.csv'),
        )
        self.set (url_parameters = qs)
        TX.WWW_Reader.make_url (self)


----------------------------------------------------------------------------------------------------

Let me know what you think.

Hi all,

In an effort to do some serious cleaning up of a hopelessly cluttered working 
environment, I developed a modular data transformation system that pretty much 
stands. I am very pleased with it. I expect huge time savings. I would share 
it, if had a sense that there is an interest out there and would appreciate 
comments. Here's a description. I named the module TX:

The nucleus of the TX system is a Transformer class, a wrapper for any kind of 
transformation functionality. The Transformer takes input as calling argument 
and returns it transformed. This design allows the assembly of transformation 
chains, either nesting calls or better, using the class Chain, derived from 
'Transformer' and 'list'. A Chain consists of a sequence of Transformers and is 
functionally equivalent to an individual Transformer. A high degree of 
modularity results: Chains nest. Another consequence is that many 
transformation tasks can be handled with a relatively modest library of a few 
basic prefabricated Transformers from which many different Chains can be 
assembled on the fly. A custom Transformer to bridge an eventual gap is quickly 
written and tested, because the task likely is trivial.
    A good analogy of the TX methodology is a road map with towns scattered all 
over it and highways connecting them. To get from any town to any other one is 
a simple matter of hopping the towns in between. The TX equivalent of the towns 
are data formats, the equivalent of the highways are TX Transformers. They are 
not so much thought of in terms of what they do than in terms of the formats 
they take and give. Designing a library of Transformers is essentially a matter 
of establishing a collection of standard data formats. First the towns, then 
the highways.
    A feature of the TX Transformer is that it retains both its input and 
output. This makes a Chain a breeze to build progressively, link by link, and 
also makes debugging easy: If a Chain doesn't work, Chain.show () reveals the 
failing link as the first one that has no output. It can be replaced with a 
corrected instance, as one would replace a blown fuse. Running the Chain again 
without input makes it have another try.
    Parameter passing runs on a track that is completely separate from the 
payload track. Parameters can be set in order to configure a Chain prior to 
running it, or can be sent at runtime by individual Transformers to its 
siblings and their progeny. Parameters are keyed and get picked up by those 
Chain links whose set of pre-defined keys includes the parameter's key. 
Unintended pick-ups with coincidentally shared keys for unrelated parameters 
can be prevented by addressing parameters to individual Translators.

Below an application example. Five custom classes at the end exemplify the 
pattern. I join the post also as attachment, in case some auto-line-wrap messes 
up this text. 

Commentary welcome

Frederic



----------------------------------------------------------------------------------------------------

An example of use: Download historic stock quotes from Yahoo Finance for a 
given range of dates and a list of symbols, delete a column and add three, 
insert the data in a MySQL table. Also write them to temporary files in tabular 
form for verification.
    "make_quotes_writer ()" returns a custom transformation tree. "run_quotes 
()" makes such a tree, sets it on a given time range and runs it on a list of 
symbols.
    (Since Yahoo publishes the data for downloading, I presume it's okay to do 
it this way. This is a demo of TX, however, and should not be misconstrued as 
an encouragement to violate any publisher's terms of service.)


import TX, yahoo_historic_quotes as yhq

def make_quotes_writer ():                                 

     Visualizer = TX.Chain (
          yhq.percent (),
          TX.Table_Maker (has_header = True),
          TX.Table_Writer (),       
          name = 'Visualizer'
     )
   
     To_DB = TX.Chain (yhq.header_stripper(), TX.DB_Writer(table_name = 
'quotes'), name = 'To DB')
   
     To_File = TX.Chain (Visualizer, TX.File_Writer (), name = 'To File')
   
     Splitter = TX.Splitter (To_DB, To_File, name = 'Splitter')
   
     Quotes = TX.Chain (
          yhq.yahoo_quotes (),
          TX.CSV_To_List (delimiter = ','),
          TX.Numerizer (),
          yhq.wiggle_and_trend (),
          yhq.symbol (),
          Splitter,
          name = 'Quotes'
     )   
   
     return Quotes


>>> Quotes = make_quotes_writer ()
>>> Quotes.show_tree()

Quotes
Quotes[0] - Yahoo Quotes
Quotes[1] - CSV To List
Quotes[2] - Numerizer
Quotes[3] - Wiggle and Trend
Quotes[4] - Symbol
Quotes[5] - Splitter
Quotes[5][0] - To DB
Quotes[5][0][0] - Header Stripper
Quotes[5][0][1] - DB Writer
Quotes[5][1] - To File
Quotes[5][1][0] - Visualizer
Quotes[5][1][0][0] - Percent
Quotes[5][1][0][1] - Table Maker
Quotes[5][1][0][2] - Table Writer
Quotes[5][1][1] - File Writer


def run_quotes (symbols, from_date = '1970-01-01', to_date = '2099-12-31'):
    '''Downloads historic stock quotes from Yahoo Finance.
    Inserts data into MySQL table "quotes" relying on index to reject repeat 
insertions.
    Also writes the data in a table format to a temporary file, one for each 
symbol.
    '''
    Quotes = make_quotes_writer ()
    Quotes.set (from_date = from_date, to_date = to_date)
    FW = Quotes [5][1][1]  # File Writer
    for symbol in symbols:
         outfilename = '/tmp/%s' % symbol
         FW.set (terminal = outfilename)
         Quotes.set (symbol = symbol)
         Quotes ()


>>> run_quotes (('MO', 'SU'), '2013-08-10')

Checking:

>>> DBRC = TX.DB_Run_Command ()
>>> for record in DBRC ("select * from quotes where symbol in ('MO','SU') and 
>>> date >= 20130810;"):
        print record

(datetime.date(2013, 8, 16), 'MO', 34.51, 34.63, 34.25, 34.29, 8364900L, 
0.0110337, -0.00638792)
(datetime.date(2013, 8, 15), 'MO', 34.78, 34.93, 34.5, 34.57, 15144400L, 
0.0123866, -0.00604926)
(datetime.date(2013, 8, 14), 'MO', 35.23, 35.25, 34.81, 35.06, 5871900L, 
0.0125607, -0.00485298)
(datetime.date(2013, 8, 13), 'MO', 35.17, 35.27, 35.01, 35.22, 5919700L, 
0.00739898, 0.00142288)
(datetime.date(2013, 8, 12), 'MO', 35.25, 35.28, 35.05, 35.16, 4884300L, 
0.00654059, -0.00255936)
(datetime.date(2013, 8, 16), 'SU', 33.84, 34.63, 33.81, 34.18, 7396600L, 
0.0239626, 0.00993571)
(datetime.date(2013, 8, 15), 'SU', 32.84, 33.99, 32.7, 33.94, 7747100L, 
0.0386865, 0.0329885)
(datetime.date(2013, 8, 14), 'SU', 32.06, 32.91, 32.02, 32.72, 4690200L, 
0.0274141, 0.0203296)
(datetime.date(2013, 8, 13), 'SU', 32.04, 32.18, 31.94, 32.01, 2051900L, 
0.00748596, -0.000935745)
(datetime.date(2013, 8, 12), 'SU', 32.15, 32.35, 31.99, 32.15, 2870500L, 
0.0111906, 0.0)

>>> FR = TX.File_Reader ()
>>> print FR ('/tmp/SU')

 Date       | Symbol | Open  | High  | Low   | Close | Volume     | Wiggle | 
Trend  |

 2013-08-16 | SU     | 33.84 | 34.63 | 33.81 | 34.18 | 7396600.00 | 2.40%  | 
+0.99% |
 2013-08-15 | SU     | 32.84 | 33.99 | 32.70 | 33.94 | 7747100.00 | 3.87%  | 
+3.30% |
 2013-08-14 | SU     | 32.06 | 32.91 | 32.02 | 32.72 | 4690200.00 | 2.74%  | 
+2.03% |
 2013-08-13 | SU     | 32.04 | 32.18 | 31.94 | 32.01 | 2051900.00 | 0.75%  | 
-0.09% |
 2013-08-12 | SU     | 32.15 | 32.35 | 31.99 | 32.15 | 2870500.00 | 1.12%  | 
+0.00% |


----------------------------------------------------------------------------------------------------


Each Transformer retains input and output, freezing the state of a composite 
after each run. This makes developing step by step and debugging a breeze. If a 
Chain fails, the failing link is the first one to have no output. Chain.show () 
displays them all in sequence. The deficient Transformer can be replaced with a 
fixed instance and the Chain run again without input for another try. 

>>> Quotes.replace (Fixed_Numerizer, 2)
>>> catch = Quotes ()  # catch removes the risk of flooding the display
 

----------------------------------------------------------------------------------------------------

These are the five custom classes. The decorator "setup" updates parameters and 
prevents needless reruns comparing time stamps.


class wiggle_and_trend (TX.Transformer):
    '''Deletes column "Adj Close" and adds columns "Wiggle" and "Trend". The 
values are
    ratios: (day's high - day's low) / mean, and (day's close - day's open) / 
mean.
    '''
    name = 'Wiggle and Trend'
    @TX.setup
    def transform (self):
        input = self.Input.data
        output = []
        output.append (tuple (input[0][:-1]) + ('Wiggle', 'Trend'))
        for i in range (1, len (input)):
            date, open, high, low, close, vol, adj = input [i]
            wiggle = high - low
            mean = (high + low) / 2.0
            wiggle_ratio = wiggle / mean
            trend = close - open
            trend_ratio = trend / mean
            output.append ((date, open, high, low, close, vol, wiggle_ratio, 
trend_ratio))
        self.Output.take (output)


class symbol (TX.Transformer):
    'Adds a column Symbol'
    name = 'Symbol'
    def __init__ (self):
        TX.Transformer.__init__ (self, symbol = None)
    @TX.setup
    def transform (self):
        symbol = self.get ('symbol')
        if not symbol:
            self.log ('No symbol!')
        else:
            input = self.Input.data
            output = []
            output.append ((input[0][0], 'Symbol') + input [0][1:])
            for i in range (1, len (input)):
                output.append ((input[i][0], symbol) + input [i][1:])
            self.Output.take (output)


class percent (TX.Transformer):
    'Converts float ratio to percent for better legibility'
    name = 'Percent'
    @TX.setup
    def transform (self):       
        input = self.Input.data
        output = [input [0]]
        for i in range (1, len (input)):
            wiggle = '%5.2f%%' % (input [i][7] * 100.0)
            trend =  '%+5.2f%%' % (input [i][8] * 100.0)
            output.append (input [i][:7] + (wiggle, trend))
        self.Output.take (output)


class header_stripper (TX.Transformer):
    'Header names are not meant for insertion into data base table'
    name = 'Header Stripper'
    @TX.setup
    def transform (self):
        self.Output.take (self.Input.data [1:])



class yahoo_quotes (TX.WWW_Reader):

    'Gets historic stock quotes from Yahoo Finance'

    import urllib
    name = 'Yahoo Quotes'
    URL_TRUNK = 'http://ichart.finance.yahoo.com/table.csv'

    def __init__ (self, from_date = '1970-01-01', to_date = '2099-12-31', 
**keywords):
      private_keys = {
        'symbol' : None,
        'from_date' : from_date,
        'to_date' : to_date,
        'url_trunk' : self.URL_TRUNK,
      }
      private_keys.update (keywords)
      TX.WWW_Reader.__init__ (self, **private_keys)

    def make_url (self):
        fy, fm, fd = self.get ('from_date').split ('-')
        ty, tm, td = self.get ('to_date').split ('-')
        fm = '%0d' % (int (fm) - 1)
        tm = '%0d' % (int (tm) - 1)
        symbol = self.get ('symbol')
        qs = (
            ('s',symbol),
            ('a',fm),('b',fd),('c',fy),
            ('d',tm),('e',td),('f',ty),
            ('g','d'),
            ('ignore', '.csv'),
        )
        self.set (url_parameters = qs)  
        TX.WWW_Reader.make_url (self)

   
----------------------------------------------------------------------------------------------------

Let me know what you think.

-- 
http://mail.python.org/mailman/listinfo/python-list

A data transformation framework. A presentation inviting commentary.

Reply via email to