Data Ingestion with Pandas

Data Ingestion with Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. Pandas make it easy to load, manipulate and clean messy data. Source)

In this tutorial, We would talk about data ingestion using pandas , a fundamental step in any Data Science project, we'll go through ways you can load different data format using pandas. At the end of this tutorial, you will know how to:

  • Load and modify Flat Files

  • Load and modify Excel Files

Working with Flat Files

Data can be stored in many ways, flat files is common of them. Flat files are easy to produce format with data being stored as plain text. Every line in the files are represented as rows and the columns are separated by characters called Delimiter, the delimiter are usually comma. These files are called comma-separated-values (csv) files. Pandas provides a way to load csv files using pands.read_csv() function.

import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/umoja/Train.csv')

pic2.png

From the image above, we imported the pandas library using import pandas as pd, this line of code will allow us to use pandas in our data science project. Then we declared a variable called train which will hold the data we want to load with the pd.read_csv() function. train.head() allow us to print the first five rows of the data.

Lets get the shape of the data with data.shape

pic3.png

We can see from the image above that our data set consist of 12079 rows and 14 columns. There are times when we do not want all the columns or rows in the data, we just need some specific columns or rows. We can limit the number columns or rows.

Limiting Columns and rows

Choose the columns to load with the usecols keyword argument, The usecols can take a list of either all column name or all column number to import, it also allows you to use function to select the columns to load.

col_names = ["Policy Start Date",'Gender','Age','Car_Category',
             'Subject_Car_Colour','target']

df = pd.read_csv('/content/drive/MyDrive/umoja/Train.csv',
                 usecols=col_names)
df.head(4)

pic4.png

We started by specifying the name of the columns we need in the data and put them in a list called col_names then we load the data using pandas read_csv() specifying the data path. we now use the usecols keyword argument to tell pandas to load only the columns we specify in the col_names list. We can also use the column number to reference the columns we want, it will output the same result.

col_num = [1,3,4,7,8,13]
df = pd.read_csv("/content/drive/MyDrive/umoja/Train.csv",
                 usecols=col_num)
df.head(4)

pic5.png

Another option is to reduce the number of rows to be imported using nrows. When reading the files, we pass the in the integer of the maximum number of rows we want, checking the output we can see the result came back True

pic6.png

nrows can be powerful when combined with skiprows, skiprows allows us to skip particular rows, it accepts a list of rows to skip, a number of rows or a function to filter rows.

Note that pandas automatically make the first row imported the header. So if we want to skip the row with column names, we MUST specify that header=None.

data = pd.read_csv('/content/drive/MyDrive/umoja/Train.csv', usecols=col_num, 
                   nrows=3000,
                   skiprows=500,
                   header=None)

pic7.png

Notice there are no columns names in the table. We can fix that using the names keyword argument to specify the column names, As you would guess the names keyword argument takes a list of columns name we want to use.

data = pd.read_csv('/content/drive/MyDrive/umoja/Train.csv', usecols=col_num, 
                   nrows=3000,
                   skiprows=500,
                   header=None,
                   names=col_names)
data.head()

pic8.png

when using the names keyword argument, the list must include the name of every column in the data. If you want to rename just few columns, you will have to do that after import.

Working with Excel Files

Excel files store data in worksheets that contain cells arranged as a grid of rows and columns. Excel files may have multiple worksheets containing data as well as charts. we use pandas function read_excel() to load excel files. When there are multiple worksheets in the file, read_excel() pull data from the first sheet by default. This can be change with the sheet_name keyword argument. sheet_name takes the name of the sheet we want to pull from the workbook or the number specifying the position of the sheet name in the workbook. Lets get an excel file with more than one worksheet and load it.

df1 = pd.read_excel('/content/Data Refresh Sample Data.xlsx')
df1.head()

By default pandas load the data from the first worksheet. Lets get the name of the worksheets.

pic11.png From the image, we know we have 3 worksheet in the excel file. We can pull data from the second sheet by referencing the sheet name or its indexing position in the workbook.

Indexing starts from Zero

df1 = pd.read_excel('/content/Data Refresh Sample Data.xlsx', sheet_name=2)
df1.head()

pic12.png

Conclusion

We started by defining what pandas is, we look at how we can import the two common type of data formats; flat files and excel files. We also look at how we can modifying the imported data. Pandas provides us with ways to load and manipulate data. Check pandas Pandas Documentation for more.