Charting using python, pandas, matplotlib and a public dataset

Charting using python, pandas, matplotlib and a public dataset
SHARE

When studying the charts above, please keep in mind the following constraints that have been implemented in the code:

  • There are no healthcare, retirement, vacation, and entertaining categories since the dataset does not have related information.
  • The charts assume a family of two earning parents and one child that goes to the international school since a bigger family would have meant no savings in some of the countries considered.
  • Rome, Italy option assumes no car ownership and the percentage of income for housing is set to 25% for Italy and 28% for other countries. This was done to ensure > 0 savings for Rome.

Why I wrote this post

Python has powerful data analysis and charting feature via the pandas and matplotlib libraries as any data science professional will attest. I wanted to target this post primarily to developers who are not yet into machine learning and want to strengthen their data analysis skills in python. I have used a public domain dataset: Global Cost of Living from kaggle. to generate the charts.

The python concepts covered by the code in this post:

Setting up the Jupyter notebook development environment in your machine

Jupyter notebook is a great interactive environment to learn/develop python code and also to export content for publishing to websites and blogs. Yes, we can develop code in traditional Python IDE as well. But getting familiar with Jupyter in your local machine first will allow for an easy transition to data science cloud platforms like kaggle.

To setup Jupyter notebook on your local machine see:

Once you open the Jupyter notebook in your machine you can copy the code that follows below into the cells and run them. In the code below:

  • '%run dataAnalysis.ipynb' is referring to a 2nd notebook. You need to copy content from dataAnalytics.ipynb into a file named dataAnalysis.ipynb and save it in the appropriate location in your local environment.
  • You will need to uncomment the pip install statements in the first cell of dataAnalysis.ipynb and run it in the notebook to ensure that all dependencies are installed. You can comment them after you install the dependencies.
  • You will need to copy Global Cost of Living from kaggle onto your local environment.

Hope this post was useful to you. You can leave your comments in the Enter comments here.. box at end of this page and hit Submit Reply.

from iteround import saferound
import matplotlib.pyplot as plt
import seaborn as sns

%run dataAnalysis.ipynb

dataAnalysis = DataAnalysis('cost-of-living_v2.csv')
dataAnalysis.ShowG7CountriesComparison()

dataAnalysis.ipynb

import pandas as pd
import numpy as np

#!pip install matplotlib
#!pip install seaborn
#!pip install mortgage
#!pip install iteround

#!pip install ipympl

# If using JupyterLab 2
#!pip install nodejs
#!jupyter labextension install @jupyter-widgets/jupyterlab-manager
#!jupyter labextension install jupyter-matplotlib
from mortgage import Loan
from iteround import saferound

## For capital city in each of the World's biggest 6 economies by GDP
### Distribute Avg monthly net salary for a family of 2 adults, 1 school going kid (assume private Intl. primary school ):
#### 1. Apartment mortgage outside of center
#### 2. Groceries
#### 3. Utilities
#### 4. Transportation
#### 5. Healthcare
#### 6. Schooling
#### 7. Misc. Goods & Services
#### 8. Auto insurance
#### 9. Home insurance
#### 10. Savings
#### Depict the above in a stacked bar chart
#### Depict comparison of savings potential across the G7 capitals in a bar chart.
class DataAnalysis:
    
    # dictionary with G7 countries and their capitals
    capitals_G7 = [('United States','Washington'),('Japan','Tokyo'), ('Germany','Berlin'), 
               ('France','Paris'), ('United Kingdom','London'),
               ('Italy','Rome'), ('Canada','Ottawa')]
    
    # map the column names
    MealInexpensive = 'x1'
    Meal2People = 'x2'
    McMeal = 'x3'
    BeerDomesticRestaurant = 'x4'
    BeerImportedRestaurant = 'x5'
    Cappucino = 'x6'
    Coke = 'x7'
    Water = 'x8'
    Milk = 'x9'
    Bread = 'x10'
    Rice = 'x11'
    Eggs = 'x12'
    Cheese = 'x13'
    Chicken = 'x14'
    Beef = 'x15'
    Apples = 'x16'
    Banana = 'x17'
    Oranges = 'x18'
    Tomato = 'x19'
    Potato = 'x20'
    Onion = 'x21'
    Lettuce = 'x22'
    Water = 'x23'
    Wine = 'x24'
    BeerDomesticMarket = 'x25'
    BeerImportedMarket = 'x26'
    Cigarettes = 'x27'
    LocalTransport1Way = 'x28'
    MonthlyPass = 'x29'
    TaxiStart = 'x30'
    Taxi1km = 'x31'
    Taxi1hrWaiting = 'x32'
    Gasoline = 'x33'
    CarVolkswagenGolf = 'x34'
    CarToyotaCorolla = 'x35'
    Utility = 'x36'
    PrepaidMobile = 'x37'
    InternetAndMobile = 'x38'
    FitnessClub = 'x39'
    TennisCourtRent = 'x40'
    Cinema = 'x41'
    Preschool = 'x42'
    PrimarySchool = 'x43'
    Jeans = 'x44'
    SummerDress = 'x45'
    NikeRunningShoes = 'x46'
    LeatherBusinessShoes = 'x47'
    Apt1BedroomCityCtr = 'x48'
    Apt1BedroomNoCityCtr = 'x49'
    Apt3BedroomCityCtr = 'x50'
    Apt3BedroomNoCityCtr = 'x51'
    PricePerSqMtCityCtr = 'x52'
    PricePerSqMtNoCityCtr = 'x53'
    AvgMonthlyNetSalary = 'x54'
    MortgageInterest = 'x55'
    df = None
    dfForStackedbar = None
    
    # constructor
    def __init__(self, csvFile):
        self.df = pd.read_csv(csvFile)
        self.df.set_index(['country','city'], inplace=True)
    
    # computer monthly grocery. The logic is based on Google searches
    # of healthy serving sizes. The divisors below like 1000 and 500 are based
    # on the data disctionary documentation. 
    # All computations are based on metric system (serving sizes are in grams, milliliters)
    def MonthlyGrocery(self, country, city, familySize):
        cupSizeInLiters = 0.24
        daysInMonth = 30.42  # 365 distributed equally into 12 months
        milkServingSize = 3 * cupSizeInLiters
        breadServingSize = 0.86 # 6 slices per week
        riceServingSize = 0.66
        cheeseServingSize = 40
        chickenServingSize = 105
        beefServingSize = 85
        applesServingSize = 54
        orangesServingSize = 152
        bananasServingSize = 250
        tomatoesServingSize = 75
        potatoesServingSize = 173
        onionsServingSize = 150
        lettuceServingSize = 75
        wineServingSize = 20/daysInMonth
        beerServingSize = 1.4/daysInMonth
        
        # (item, divisor): (servingSize, # of persons)
        dictItemsServings = {(self.Milk, 1.0): (milkServingSize,familySize), (self.Bread, 500): (breadServingSize,familySize), 
                     (self.Rice, 1000): (riceServingSize,familySize),(self.Cheese,1000): (cheeseServingSize, familySize), (self.Chicken,1000): (chickenServingSize,familySize),
                     (self.Beef,1000) : (beefServingSize,familySize),(self.Apples, 1000): (applesServingSize,familySize), (self.Oranges,1000): (orangesServingSize, familySize),
                     (self.Banana,1000): (bananasServingSize, familySize),(self.Tomato,1000): (tomatoesServingSize, familySize),(self.Potato,1000): (potatoesServingSize, familySize),
                     (self.Onion,1000): (onionsServingSize, familySize),(self.Lettuce,680): (lettuceServingSize, familySize),# 1 head lettuce = ~ 680gms
                     (self.Water,1): (1, familySize),(self.Wine,25): (wineServingSize, 2),(self.BeerDomesticMarket, 0.5):(beerServingSize, 2)}
    
        # list comprehenion with compute for each grocery item
        groceryItemCostList = [self.df.loc[(country, city),k[0]]/k[1] * v[0] * v[1] for k,v in dictItemsServings.items()]
    
        return np.sum(groceryItemCostList) * daysInMonth
        
    # Utilities - Electricity, Heating, Cooling, Water, Garbage
    def Utilities(self, country, city, homeSizeInSqMt):
        basic = self.df.loc[(country, city), self.Utility]/85 * homeSizeInSqMt
        return basic
    
    def InternetAccess(self, country, city):
        return self.df.loc[(country, city), self.InternetAndMobile]
    
    def Mobile(self, country, city, numAdults):
        daysInMonth = 30.42  # 365 distributed equally into 12 months
        return self.df.loc[(country, city), self.PrepaidMobile] * numAdults * 25 * daysInMonth
    
    # Assume fitness club for adults. Tennis 2 hrs/week only for kids. Assume Fitness club includes tennis/games for adults
    def Fitness(self, country, city, adults, kids):
        fitnessClub = self.df.loc[(country,city), self.FitnessClub] * adults
        tennis = self.df.loc[(country,city), self.TennisCourtRent] * 2 * kids

        return (fitnessClub + tennis)

    # only consider cinema every weekas entertainment since that is the only data available
    def Entertainment(self, country, city, familySize):
        movieTickets = self.df.loc[(country, city), self.Cinema] * familySize * 4
        # McMeal is the stand-in for coke popcorn etc
        snacks = self.df.loc[(country, city), self.McMeal] * familySize * 4
        return movieTickets + snacks
    
    def EatingOut(self, country, city, adults, kids):
        # assume eating out once/month + McMeal once/week
        restaurant = self.df.loc[(country, city), self.Meal2People] * (adults + kids) / 2
        MacMeal = self.df.loc[(country, city), self.McMeal] * (adults + kids)
        beverage =  self.df.loc[(country, city), self.Cappucino] * adults
        drinksAdults = (self.df.loc[(country, city), self.BeerDomesticRestaurant] + 
                        self.df.loc[(country, city), self.BeerImportedRestaurant])/2 * adults                  
        drinksKids = self.df.loc[(country, city), self.Coke] * kids

        return (restaurant + drinksAdults + drinksKids) + (beverage + MacMeal) * 4
    
    # assume kids go to International school
    def School (self, country, city, kids):
        return self.df.loc[(country, city), self.PrimarySchool] * kids / 12

    # will compute both owning and rent to ensure no more than 28% of income
    def Housing(self,country,city,earningAdults,kids,durationYears=20):
        size3BedroomSqMt = 140
        size2BedroomSqMt = 105
        size1BedroomSqMt = 70

        housingOptions = [(True,self.PricePerSqMtCityCtr, size3BedroomSqMt ), (True, self.PricePerSqMtNoCityCtr,size3BedroomSqMt),
                          (True,self.PricePerSqMtCityCtr, size2BedroomSqMt ), (True, self.PricePerSqMtNoCityCtr,size2BedroomSqMt),
                          (False,self.Apt3BedroomCityCtr, None), (False, self.Apt3BedroomNoCityCtr, None),     
                          (True,self.PricePerSqMtCityCtr, size1BedroomSqMt ), (True, self.PricePerSqMtNoCityCtr,size1BedroomSqMt),
                          (False, self.Apt1BedroomCityCtr, None),(False, self.Apt1BedroomNoCityCtr, None)]

        percentageOfIncomeForHousing = 0.25 if (country == 'Italy') else 0.28

        # since the dataset only has the expensive international school option, take school expense out of 
        # income before computing housing
        maxPortionOfIncomeForHousing = (self.df.loc[(country,city),self.AvgMonthlyNetSalary] * earningAdults - \
                                        self.School(country, city, kids)) * percentageOfIncomeForHousing

        #print('Max for housing:{}'.format(maxPortionOfIncomeForHousing))
        rent = True
        bedrooms = 1

        for own,opt,size in housingOptions:
            if own:
                price = self.df.loc[(country,city),opt]    
                sqMt = size    
                loanAmount = sqMt*price
                rate = self.df.loc[(country,city),self.MortgageInterest]/100
                loan = Loan(principal=loanAmount, interest = rate, term=durationYears*12)  
                payment = float(loan.schedule(1).payment)            
                bedrooms = 3 if size == size3BedroomSqMt else 2 if size == size2BedroomSqMt else 1
                rent = False
                #print('Mortgage payment:{} for {} bedrooms'.format(payment, bedrooms))
            else:
                payment = float(self.df.loc[(country,city),opt])
                bedrooms = 3 if opt == self.Apt3BedroomCityCtr else 1
                rent = True
                #print('Rent:{} for {} bedrooms'.format(payment, bedrooms))

            if (payment < maxPortionOfIncomeForHousing):
                    break;

        return payment, rent, bedrooms

    def CarLoan(self, country, city):
        loanAmount = self.df.loc[(country,city),self.CarVolkswagenGolf] if self.df.loc[(country,city),self.CarVolkswagenGolf] < \
        self.df.loc[(country,city),self.CarToyotaCorolla] else self.df.loc[(country,city),self.CarToyotaCorolla]

        # assume auto loan rate = mortgage rate + 1%
        rate = (self.df.loc[(country,city),self.MortgageInterest] + 1)/100
        loan = Loan(principal=loanAmount, interest = rate, term=5*12)
        payment = loan.schedule(1).payment
        #print((loanAmount, payment))
        return float(payment)

    # https://www.odyssee-mure.eu/publications/efficiency-by-sector/transport/distance-travelled-by-car.html
    def Transportation(self,country,city, familySize):
        kmsPerLiter = 11
        carsKms = {'United States': (2,19600), 'Canada': (2, 17000), 'United Kingdom': (1, 11800),
                  'France': (1, 12200), 'Germany':(1,13600), 'Italy':(0,8500) , 'Japan': (1,6730)}
        carLoan = 0.0
        gasolineExpense = 0.0
        monthlyPass = self.df.loc[(country,city), self.MonthlyPass]
        #print(monthlyPass)

        if (carsKms.get(country) != None):
            carLoan = carsKms[country][0] * self.CarLoan(country,city)
            #print(carLoan)
            pricePerLiter = self.df.loc[(country,city), self.Gasoline]
            #print(pricePerLiter)
            pricePerKm = pricePerLiter/kmsPerLiter
            #print(pricePerKm)
            gasolineExpense = pricePerKm * (carsKms[country][0] * carsKms[country][1])/12
            #print(gasolineExpense)
        else:
            monthlyPass *= familySize # in non car owning countries assume each family member will need pass

        return (monthlyPass + carLoan + gasolineExpense)
    
    # this includes new clothes in summer and winter, new shoes once/year, sports- once/year
    def ClothingAndShoes(self,country, city, adults, kids):
        familySize = adults+kids
        items = [self.LeatherBusinessShoes,self.NikeRunningShoes,self.Jeans,self.SummerDress]
    
        # list comprehenion 
        itemsList = [self.df.loc[(country, city), item] for item in items]
    
        return np.sum(itemsList) / 12
    
    def Compute(self):
        # construct the template for spending categories as a dictionary 
        dictForStackedbar = {'country': [item[0] for item in self.capitals_G7], 'city': [item[1] for item in self.capitals_G7],\
                   'Income': [0] * 7,'Housing': [0] * 7,'Rented':[True] * 7, 'Bedrooms': [1] * 7, 'Grocery': [0] * 7,'Utilities': [0] * 7,\
                   'School': [0] * 7,'Transportation': [0] * 7,'Fitness':[0] * 7,'InternetAndMobile':[0] * 7, \
                   'Discretionary': [0] * 7, 'Savings': [0] * 7}

        homeSizes = {1: 70, 2: 105, 3: 140} # bedrooms:home size in SqMt

        # create dataframe from dictionary
        self.dfForStackedbar = pd.DataFrame(dictForStackedbar)
        self.dfForStackedbar.set_index(['country', 'city'], inplace=True)

        # populate each of the columns with values
        for country, city in self.dfForStackedbar.index:    
            totalIncome = self.df.loc[(country, city), self.AvgMonthlyNetSalary] * 2
            self.dfForStackedbar.loc[(country, city),'Income'] = totalIncome # both parents earning

            payment, rented, bedrooms = self.Housing(country,city,2,1)
            self.dfForStackedbar.loc[(country, city),'Housing'] = payment

            self.dfForStackedbar.loc[(country, city),'Rented'] = rented 
            self.dfForStackedbar.loc[(country, city),'Bedrooms'] = bedrooms 

            monthlyGrocery = self.MonthlyGrocery(country, city, 3)
            self.dfForStackedbar.loc[(country, city),'Grocery'] = monthlyGrocery

            utilities = self.Utilities(country, city, homeSizes[bedrooms])
            self.dfForStackedbar.loc[(country, city),'Utilities'] =utilities

            school =  self.School(country, city, 1)
            self.dfForStackedbar.loc[(country, city), 'School'] = school

            internetAccess = self.InternetAccess(country, city)
            mobile =  self.Mobile(country, city,2)
            self.dfForStackedbar.loc[(country, city), 'InternetAndMobile'] = internetAccess + mobile

            transportation = self.Transportation(country, city,3)
            self.dfForStackedbar.loc[(country, city), 'Transportation'] = transportation

            fitness =  self.Fitness(country, city,2,1)
            self.dfForStackedbar.loc[(country, city), 'Fitness'] = fitness

            discretionary = self.Entertainment(country, city,3) + self.EatingOut(country, city,2,1) + self.ClothingAndShoes(country, city,2,1)
            self.dfForStackedbar.loc[(country, city), 'Discretionary'] = discretionary

            self.dfForStackedbar.loc[(country, city), 'Savings'] = 0
        
        # create a filter for excluding unnecessary columns for savings computation
        filterMask = (self.dfForStackedbar.columns != 'Income') & (self.dfForStackedbar.columns != 'Rented') & (self.dfForStackedbar.columns != 'Bedrooms')
        
        # from the income substract all items except 'Income', 'Rented' and 'Bedrooms'
        self.dfForStackedbar['Savings'] = [self.dfForStackedbar.loc[(country, city),'Income'] - \
                                        np.sum(self.dfForStackedbar.loc[(country, city),filterMask]) \
                                        for country, city in self.dfForStackedbar.index]
        
        #print(self.dfForStackedbar)
    
    def ShowG7CountriesComparison(self):
        
        # prepare the data for presentation
        self.Compute()
        
        # sort by income Descending
        self.dfForStackedbar.sort_values(by='Income', ascending=False, inplace=True)
        # create dataset from a dictionary with Months as the only column. This is 
        # display the 2nd chart
        dictForSavingsAccum = {'Months': [0] * 7}
        dfSavingsAccum = pd.DataFrame(dictForSavingsAccum)

        # simple division to figure out the months needed to accumulate savings=1 month's income
        months =  np.round(self.dfForStackedbar['Income'] / self.dfForStackedbar['Savings'])
        
        #list comprehension to create the values for the months column
        months = [int(m) for m in months]
        dfSavingsAccum['Months'] = months
        dfSavingsAccum.sort_values(by='Months', inplace=True)
        
        # create 2 subplots: 2,1 means 2 rows and 1 column. constrained_layout=True allows for spacing
        # between the 2 subplots
        fig, axs = plt.subplots(2, 1,
                       constrained_layout = True)
        
        # remove columns not related to spending
        dfPlot = self.dfForStackedbar.drop(['Income','Rented','Bedrooms'], axis=1)
        #define Seaborn color palette to use. 
        colors = sns.color_palette('colorblind')[0:len(dfPlot.columns)]

        # plot the stacked barchart
        ax = dfPlot.plot(kind='bar', ax=axs[0],figsize=(12,16), stacked=True,color=colors,xlabel='Capitals')

        # output the % in each of individual segments of each bar
        for c in ax.containers:
            totalIncome = np.sum([v.get_height() for v in c])
            pct = saferound([v.get_height()/totalIncome*100 for v in c], places=0)                
            ax.bar_label(c, labels=['%{}'.format(int(p)) for p in pct],label_type='center')    

        ax.set_title('Income distribution comparison across G7 capitals',fontweight='bold')
        ax.set_ylabel('Family Income distributed across categories',fontweight='bold');
        ax.set_xlabel('Capitals',fontweight='bold');
        
        # make the x axis label display at 45 degree angle
        xticklabels = ax.set_xticklabels([city for country,city in dfPlot.index],rotation=45)

        # now plot the histogram about savings

        ax = dfSavingsAccum.plot(kind='bar',ax=axs[1],figsize=(12,16))

        ax.set_title("Months to accumulate savings worth one month's income",fontweight='bold')
        ax.set_xlabel('Capitals',fontweight='bold')
        ax.set_xticks(range(0,len(dfPlot.index)))
        xticklabels = ax.set_xticklabels([city for country,city in dfPlot.index], rotation=45)

        # turn of the ticks to have a minimal chart since values on top of each bar will suffice
        ax.tick_params(axis='x', which='both', bottom=False, top=False)
        ax.tick_params(axis='y', which='both', left=False, right=False, labelleft=False)

        # remove the legend since it provides no additional information
        ax.get_legend().remove()
        
        #iterate over bars and output the height of the bar at appropriate location on top
        for container in ax.containers:
            for i, child in enumerate(container.get_children()):
                monthsToSave = dfSavingsAccum['Months'].tolist()
                plt.text((child.get_x() + child.get_width()/2), child.get_height() -0.5,str(monthsToSave[i]), color='white')
                
        # save to jpg
        plt.savefig("g7Comparison.png")