On 10/27/2013 01:31 AM, Nick the Gr33k wrote: > Στις 27/10/2013 6:00 πμ, ο/η ru...@yahoo.com έγραψε: >[...] [following quote lightly edited for clarity] > I almost understand your code, but this part is not so clear to me: > key = host, city, useros, browser > if key not in seen: newdata.append( [host, city, useros, browser, [ref], hits, [visit]] ) > seen[key] = len( newdata ) - 1 # Save index (for 'newdata') of this row. > else: # This row is a duplicate row with a different referrer & visit time. > rowindex = seen[key] > newdata[rowindex][4].append( ref ) > newdata[rowindex][5] += hits > newdata[rowindex][6].append( visit )
I'm not sure exactly what part is not clear to you so I'll give you a very long-winded explanation and you can ignore any parts that are already obvious to you. The code above is inside a loop that looks at each row in <data>. In <data> there can be several rows for the same visitor, where you define a visitor as a unique combination of <host>, <city>, <useros> and <browser>. What you want to do is combine all of the rows that are for the same visitor into one row. That one row, instead of having a single value for <ref> and <lastvisit> will have lists of all the <ref>s and <lastvisit>s from all the rows that have the same visitor value. So first, for each row, we set <key> to a tuple that identifies the visitor. (Actually, I should have named that variable "visitor" instead of "key".) Then we use an ordinary python dictionary <seen> to record each visitor as we see them. Remember that a dictionary can use a tuple as a key (unlike Perl were a hash key has to be a string). For each row we look in the dictionary <seen> to see if this visitor is a new one that we haven't seen before. If we haven't seen them before we create a new row in <newdata> for them that is a copy of the row in <data> except we change the <ref> and <lastvisit> fields from single values to lists. We also add an entry to <seen> whose key is the visitor, and whose value is the index of the vistor's row in <newdata>. If the visitor *was* seen before (because we find an entry for the visitor in <seen>), then the value of that entry tells us the index of that visitor's row in <newdata> and instead of adding a new row to <newdata> we update the visitors row that is already there. Maybe it's easier to see what is happening by looking at how the code actually runs. Suppose the data you get from your database is data = ['mail14.ess.barracuda.com', 'Άγνωστη Πόλη', 'Windows', 'Explorer', 'Direct Hit', '1', 'Σάββατο 26 Οκτ, 18:49', '209.133.77.165.T01713-01.above.net', 'Άγνωστη Πόλη', 'Windows', 'Explorer', 'Direct Hit', '1', 'Σάββατο 26 Οκτ, 18:59', 'mail14.ess.barracuda.com', 'Άγνωστη Πόλη', 'Windows', 'Explorer', 'http://superhost.gr/', '1', 'Σάββατο 26 Οκτ, 18:48', ] When the first row of <data> is processed, <key> will be set to the 4-tuple: ('mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer'). Then, when "if key not in seen" is executed. This will look in dictionary <seen> and see if there in an entry in it with a key that matches the tuple above. Since <seen> is still an empty dictionary, <key> is not in the dictionary because there is nothing in the dictionary and "if key not in seen" is true. So the first branch of the if statement runs: newdata.append( [host, city, useros, browser, [ref], hits, [visit]] ) seen[key] = len( newdata ) - 1 # Save index (for 'newdata') of this row. Now, <newdata> contains 1 row: [ 'mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer', ['Direct Hit'], 1, ['Σάββατο 26 Οκτ, 18:49'] ] And, <seen> contains: { ('mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer'): 0 } Note the the 0 value in the <seen> dictionary is the index of the corresponding row in <newdata>. When the second row of <data> is processed, the same thing happens. <key> is the tuple ('209.133.77.165.T01713-01.above.net','Άγνωστη Πόλη','Windows','Explorer') but since the only key in <seen> is ('mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer') again the "not in" branch is executed. When it runs this time it adds another row to <newdata> so <newdata> now looks like: [ 'mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer', ['Direct Hit'], 1, ['Σάββατο 26 Οκτ, 18:49'], '209.133.77.165.T01713-01.above.net','Άγνωστη Πόλη','Windows','Explorer', ['Direct Hit'], 1, ['Σάββατο 26 Οκτ, 18:59'], ] and adds another entry to <seen> so that <seen> is now: { ('mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer'): 0, ('209.133.77.165.T01713-01.above.net','Άγνωστη Πόλη','Windows','Explorer'): 1 } Again, the 1 is the index of the corresponding row in <newdata>. Now the third row of <data> is processed. <key> is set to ('mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer') This time when "if key not in seen" is executed, it is false because that key *is* in seen, it was added the when the first <data> was processed (look at <seen> above). So the statements rowindex = seen[key] newdata[rowindex][4].append( ref ) newdata[rowindex][5] += hits newdata[rowindex][6].append( visit ) are executed. These statements will update the existing row for visitor <key> in <newdata>. The value of <key> is ('mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer') and seen[key] is 0 and <rowindex> is set to that. newdata[0] is the row in <newdata> for the same visitor. The next three lines just update that row by appending the current <data> row's <lastvisit> time to newdata[0]'s visits list. Similarly for ref, and hits field of newdata[0] is incremented by the current <data> row's hits field. After, <newdata> looks like: [ 'mail14.ess.barracuda.com','Άγνωστη Πόλη','Windows','Explorer', ['Direct Hit', 'http://superhost.gr/'], 2, ['Σάββατο 26 Οκτ, 18:49', 'Σάββατο 26 Οκτ, 18:48'], '209.133.77.165.T01713-01.above.net','Άγνωστη Πόλη','Windows','Explorer', ['Direct Hit'], 1, ['Σάββατο 26 Οκτ, 18:59'], ] And so on for the rest of the data. When a new visitor is seen a row is added to <newdata> and the visitor (identified as the tupple <key>) is saved in <seen> along with that index of that visitor's row in <newdata>. If the same visitor is seen again later in <data>, the corresponding row in <newdata> is updated rather then adding a new row to <newdata>. Did that help make it clearer? It is a lot easier to write code than to explain it :-) -- https://mail.python.org/mailman/listinfo/python-list