Demographic Analysis with Python

programming
Published

April 26, 2018

map of cemetery blocks
Before Query Table Join

The goal of this project is to analyze the demographics of donors to a non-profit in West Michigan with Python. Demographic data was acquired from the U.S. Census Bureau’s American Fact Finder, donor data through my geocoding project, and the shapefiles were found through the Census’s TIGER shapefiles. I looked for the following data:

First, I imported the relevant modules and set the current working directory:

import pandas as pd
import geopandas as gpd
import numpy as np
import os, subprocess, shutil, pygal
from shapely.geometry import *
from functools import reduce
from pygal.style import Style
os.chdir('C://my//direc//docs')
view raw demog-import.py hosted with ❤ by GitHub

Next, the spreadsheets are turned into a pandas dataframe and the column headers are dropped.

# CSVs of demographic data
ageGend = pd.read_csv('dataInput//michigan_age_gender.csv', header=1)
edu = pd.read_csv('dataInput//michigan_education.csv', header=1)
income = pd.read_csv('dataInput//michigan_median_income.csv', header=1)
view raw demog-csv.py hosted with ❤ by GitHub

There are only certain columns that I am interested in for each spreadsheet. All other columns are dropped through this function (columns were determined manually).

def colDrop(df, col):
ncol = [name for count, name in enumerate(df.columns) if count not in col]
df.drop(ncol, axis=1, inplace=True)
colDrop(ageGend, [1,3,5,7,9,15,21,27,33,39,45,51,57,
63,69,75,81,87,93,99,105,111,177,183])
colDrop(edu, [1,15,27,39,51,75,87,99,111,123,135,147,
195,207,231,243,267,279,303,315])
colDrop(income, [1,11, 19, 27, 35, 43, 51, 59, 67, 75, 83,91])
view raw demog-drop.py hosted with ❤ by GitHub

After the demographic data is formatted correctly, it can be joined with the shapefile of census tracts.

# convert census tract shapefile to a GeoDataFrame
gdf = gpd.read_file('dataInput//censusTract//tl_2017_26_tract.shp')
gdf.rename(index=str, columns={"GEOID": "Id2"}, inplace=True)
gdf['Id2'] = gdf['Id2'].astype('int64') # convert column to an integer before merge
sheets = [ageGend, edu, income, gdf]
sheetsMerge = reduce(lambda left,right: pd.merge(left,right,on='Id2'), sheets)
view raw demog-shp.py hosted with ❤ by GitHub

In order to perform demographic analysis on the non-profit donors, the spreadsheet must be turned into a geopandas GeoDataFrame. This is done by reading in the spreadsheet, dropping any rows that were unsuccessfully geocoded, and then converting those columns into a Shapely Point object. Now that the dataframe has a correctly formatted geometry column, it can be converted into a GeoDataFrame.

df = pd.read_csv('dataInput//donors.csv')
df = df.dropna(subset=['lat']).copy() # avoid the SettingWithCopyWarning
lng_list = [i for i in df.lng]
lat_list = [i for i in df.lat]
geometry = [Point(i) for i in zip(lng_list, lat_list)]
df.drop(['serial_id','lng','lat'], axis=1, inplace=True)
donors = gpd.GeoDataFrame(df, geometry=geometry)
view raw demog-gdf.py hosted with ❤ by GitHub

The donors and demographics can then be spatially joined into one GeoDataFrame. The end result will be donors in Michigan (this is where a large majority of the donors are).

miDonors = gpd.sjoin(donors, demographics, how='inner', op='intersects')
view raw demog-sjoin.py hosted with ❤ by GitHub

ANALYSIS

I used the pygal library to make the following graphs. These graphs give a general impression of the non-profit donors and should be weighted according to where the donors are located. For example, if a large share of donors live in Ottawa County, their information will be weighted more heavily than the one donor in Oscoda County.

count = 9131 # number of data points
ageGroups = miDonors.iloc[:,21:35].astype('float64').sum() / count
# the variable custom_style is defined elsewhere in the full script
bar_chart = pygal.HorizontalBar(style=custom_style,
value_formatter=lambda x: '{}%'.format(x),
tooltip_fancy_mode=False)
bar_chart.title = 'Age Groups'
for index, value in ageGroups.iteritems():
value = round(value,2)
bar_chart.add(index, value)
bar_chart.render_to_file('age.svg')
view raw demog-age.py hosted with ❤ by GitHub

Income Groups

ageGroups = miDonors.iloc[:,56:65].astype('float64').sum() / count
bar_chart = pygal.HorizontalBar(style=custom_style,
value_formatter=lambda x: '{}%'.format(x),
tooltip_fancy_mode=False)
bar_chart.title = 'Income Groups'
for index, value in ageGroups.iteritems():
value = round(value,2)
bar_chart.add(index, value)
bar_chart.render_to_file('income.svg')
view raw demog-income.py hosted with ❤ by GitHub

Educational Attainment in Ages 25 and Over

education25 = miDonors.iloc[:,41:47].astype('float64').sum() / count
denom = education25.sum()
for index,value in education25.iteritems():
education[index] = round((value/denom)*100,2)
dark_lighten_style = LightenStyle('#00213e')
bar_chart = pygal.Bar(style=custom_style,
value_formatter=lambda x: '{}%'.format(x),
tooltip_fancy_mode=False)
bar_chart.title = 'Educational Attainment - Ages 25 and over'
for index, value in education25.iteritems():
value = round(value,2)
bar_chart.add(index, value)
bar_chart.render_to_file('education25.svg')

Finally, a heatmap is made with the Python library folium. This library allows Leaflet webmaps to be made from Python.

heat_map = folium.Map(location=[43.455737, -84.760162], zoom_start=7,
max_zoom=14, tiles='cartodbpositron')
with open('dataInput//donors.csv','r', newline='') as f:
reader = csv.reader(f)
next(reader) # skip column headers
data = [[float(row[13]), float(row[12])] for row in reader if row[12] != '']
heat = plugins.HeatMap(data, blur=25)
heat_map.add_child(heat)
heat_map.save('heatmap.html')
view raw demog-folium.py hosted with ❤ by GitHub