How to make a data frame from different CSV files in Python

AI Maverick
3 min readMay 13, 2022

--

Imagine you trained separate machine learning models or one model over different datasets and have independent outputs for various datasets. Now, you need to have a final or intermediate report from the files before going to the next steps. Therefore, you have to analyze all the results.

Is it possible to check them one by one? No, of course not. It is a time consuming process to check different files and take a note. So, what would be the alternative? The pivot table could be another way to check the files but how do store different files in one data frame?

In this scenario, we assume that we have different files with the same number of rows and a different number of columns. The goal would be the final Data frame that contains all the values.

First, you need to read the files you want to analyze from the directory. For this matter, you may use the method from the os library.

import os
for file in os.listdir('.'):
print(file)

This will print all the stored files once for you. But what if we are looking for specific files among all the outputs? I have the answer here, the fnmatch library of Python.

import fnmatchfor file in os.listdir('.'):
if fnmatch.fnmatch(file, '*_yourFileName.csv'):
print(file)

In the above code, the `yourFileName` is a string including the lookup value. And it will print the file name you’re looking for it. It is time to save the file as a data frame. For this purpose, we use the pandas library and its to_csv() method.

for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*_yourFileName.csv'):
df = pd.read_csv(file)

But this only stores the last found file. So I suggest building an empty Data frame with the proper size we are looking for first and then completing it for each file.

To build the empty data frame, I consider the following code. First, I wrote a dictionary to save the file names and their size. The data frame is built with a size of the maximum number of columns in the stored files and rows of the files. The index name is the name of the stored files. The index name is the name of the stored files.

index_ = {'name': [], 'shape': []}for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*_yourFileName.csv'):
index_['shape'].append((pd.read_csv(file).shape[1]))
index_['name'].append(file)
max_cl = max(index_['shape'])
file = index_['name'][index_['shape'].index(max_cl)]
dic = {'key': pd.read_csv(file).columns}
dic = list(pd.read_csv(file).columns)
row_names = [index_['name'][i][2:-9] for i in range(len(index_['name']))]
df = pd.DataFrame(np.zeros((len(index_['name']), max_cl)), columns=dic, index=row_names)

The final step would be storing the values in the proper axis of the data frame. So, let’s enumerate over the file names list of our dictionary and fit the data in a data frame. We have different data frames with various columns, therefore, we have to find the proper column id. To do so, I used the index of the list item. The list includes the data frame column names.

for _, file in enumerate(index_['name']): 
ave = (pd.read_csv(file)).mean()
for i, j in enumerate(ave.index):
df.iloc[_, dic.index(j)] = ave.iloc[i]

In the end, you will have the data frame we were looking for.

You can find similar codding tips on my kaggle and GitHub.

--

--