One way to go is using Pandas as it was mentioned before and Seaborn for plotting (built on top of matplotlib)

I would approach this prototyping first with a single file and not with the 1000 files that you have.

Using the code that you have for parsing, add the values to a Pandas DataFrame (aka, a table).

# load pandas and create a 'date' object to represent the file date
# You'll have "pip install pandas" to use it
import pandas as pd

file_date = pd.to_datetime('20210527')

# data that you parsed as list of lists with each list being
# each line in your file.
data = [
    ["alice", 123, file_date],
    ["bob", 4, file_date],
    ["zebedee", 9999999, file_date]
    ]

# then, load it as a pd.DataFrame
df = pd.DataFrame(data, columns=['name', 'kb', 'date'])

# print it
print(df)
            name       kb       date
      0    alice      123 2021-05-27
      1      bob        4 2021-05-27
      2  zebedee  9999999 2021-05-27

Now, this is the key point: You can save the dataframe in a file
so you don't have to process the same file over and over.

Pandas has different formats, some are more suitable than others.

# I'm going to use "parquet" format which compress really well
# and it is quite fast. You'll have "pip install pyarrow" to use it
df.to_parquet('df-20210527.pq')

Now you repeat this for all your files so you will end up with ~1000 parquet files.

So, let's say that you want to plot some lines. You'll need to load those dataframes from disk.

You read each file, get a Pandas DataFrame for each and then
"concatenate" them into a single Pandas DataFrame

all_dfs = [pd.read_parquet(<filename>) for <filename> in <...>]
df = pd.concat(all_dfs, ignore_index=True)

Now, the plotting part. You said that you wanted to use matplotlib. I'll go one step forward and use seaborn (which it is implemented on top of matplotlib)

import matplotlib.pyplot as plt
import seaborn as sns

# plot the mean of 'kb' per date as a point. Per each point
# plot a vertical line showing the "spread" of the values and connect
# the points with lines to show the slope (changes) between days
sns.pointplot(data=df, x="date", y="kb")
plt.show()

# plot the distribution of the 'kb' values per each user 'name'.
sns.violinplot(data=df, x="name", y="kb")
plt.show()

# plot the 'kb' per day for the 'alice' user
sns.lineplot(data=df.query('name == "alice"'), x="date", y="kb")
plt.show()

That's all, a very quick intro to Pandas and Seaborn.

Enjoy the hacking.

Thanks,
Martin.


On Thu, May 27, 2021 at 08:55:11AM -0700, Edmondo Giovannozzi wrote:
Il giorno giovedì 27 maggio 2021 alle 11:28:31 UTC+2 Loris Bennett ha scritto:
Hi,

I currently a have around 3 years' worth of files like

home.20210527
home.20210526
home.20210525
...

so around 1000 files, each of which contains information about data
usage in lines like

name kb
alice 123
bob 4
...
zebedee 9999999

(there are actually more columns). I have about 400 users and the
individual files are around 70 KB in size.

Once a month I want to plot the historical usage as a line graph for the
whole period for which I have data for each user.

I already have some code to extract the current usage for a single from
the most recent file:

for line in open(file, "r"):
columns = line.split()
if len(columns) < data_column:
logging.debug("no. of cols.: %i less than data col", len(columns))
continue
regex = re.compile(user)
if regex.match(columns[user_column]):
usage = columns[data_column]
logging.info(usage)
return usage
logging.error("unable to find %s in %s", user, file)
return "none"

Obviously I will want to extract all the data for all users from a file
once I have opened it. After looping over all files I would naively end
up with, say, a nested dict like

{"20210527": { "alice" : 123, , ..., "zebedee": 9999999},
"20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9},
"20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999},
"20210524": { "alice" : 123, ..., "zebedee": 9},
"20210523": { "alice" : 123, ..., "zebedee": 9999999},
...}

where the user keys would vary over time as accounts, such as 'bob', are
added and latter deleted.

Is creating a potentially rather large structure like this the best way
to go (I obviously could limit the size by, say, only considering the
last 5 years)? Or is there some better approach for this kind of
problem? For plotting I would probably use matplotlib.

Cheers,

Loris

--
This signature is currently under construction.

Have you tried to use pandas to read the data?
Then you may try to add a column with the date and then join the datasets.
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to