
The goal of this project is to analyze the demographics of donors to a non-profit in West Michigan with Python. Demographic data was acquired from the U.S. Census Bureau’s American Fact Finder, donor data through my geocoding project, and the shapefiles were found through the Census’s TIGER shapefiles. I looked for the following data:
- Median Household Income in the Past 12 Months (a widely accepted measure of income)
- Age and Gender
- Educational Attainment
First, I imported the relevant modules and set the current working directory:
import pandas as pd | |
import geopandas as gpd | |
import numpy as np | |
import os, subprocess, shutil, pygal | |
from shapely.geometry import * | |
from functools import reduce | |
from pygal.style import Style | |
os.chdir('C://my//direc//docs') |
Next, the spreadsheets are turned into a pandas dataframe and the column headers are dropped.
# CSVs of demographic data | |
ageGend = pd.read_csv('dataInput//michigan_age_gender.csv', header=1) | |
edu = pd.read_csv('dataInput//michigan_education.csv', header=1) | |
income = pd.read_csv('dataInput//michigan_median_income.csv', header=1) |
There are only certain columns that I am interested in for each spreadsheet. All other columns are dropped through this function (columns were determined manually).
def colDrop(df, col): | |
ncol = [name for count, name in enumerate(df.columns) if count not in col] | |
df.drop(ncol, axis=1, inplace=True) | |
colDrop(ageGend, [1,3,5,7,9,15,21,27,33,39,45,51,57, | |
63,69,75,81,87,93,99,105,111,177,183]) | |
colDrop(edu, [1,15,27,39,51,75,87,99,111,123,135,147, | |
195,207,231,243,267,279,303,315]) | |
colDrop(income, [1,11, 19, 27, 35, 43, 51, 59, 67, 75, 83,91]) |
After the demographic data is formatted correctly, it can be joined with the shapefile of census tracts.
# convert census tract shapefile to a GeoDataFrame | |
gdf = gpd.read_file('dataInput//censusTract//tl_2017_26_tract.shp') | |
gdf.rename(index=str, columns={"GEOID": "Id2"}, inplace=True) | |
gdf['Id2'] = gdf['Id2'].astype('int64') # convert column to an integer before merge | |
sheets = [ageGend, edu, income, gdf] | |
sheetsMerge = reduce(lambda left,right: pd.merge(left,right,on='Id2'), sheets) |
In order to perform demographic analysis on the non-profit donors, the spreadsheet must be turned into a geopandas GeoDataFrame. This is done by reading in the spreadsheet, dropping any rows that were unsuccessfully geocoded, and then converting those columns into a Shapely Point object. Now that the dataframe has a correctly formatted geometry column, it can be converted into a GeoDataFrame.
df = pd.read_csv('dataInput//donors.csv') | |
df = df.dropna(subset=['lat']).copy() # avoid the SettingWithCopyWarning | |
lng_list = [i for i in df.lng] | |
lat_list = [i for i in df.lat] | |
geometry = [Point(i) for i in zip(lng_list, lat_list)] | |
df.drop(['serial_id','lng','lat'], axis=1, inplace=True) | |
donors = gpd.GeoDataFrame(df, geometry=geometry) |
The donors and demographics can then be spatially joined into one GeoDataFrame. The end result will be donors in Michigan (this is where a large majority of the donors are).
miDonors = gpd.sjoin(donors, demographics, how='inner', op='intersects') |
ANALYSIS
I used the pygal library to make the following graphs. These graphs give a general impression of the non-profit donors and should be weighted according to where the donors are located. For example, if a large share of donors live in Ottawa County, their information will be weighted more heavily than the one donor in Oscoda County.
count = 9131 # number of data points | |
ageGroups = miDonors.iloc[:,21:35].astype('float64').sum() / count | |
# the variable custom_style is defined elsewhere in the full script | |
bar_chart = pygal.HorizontalBar(style=custom_style, | |
value_formatter=lambda x: '{}%'.format(x), | |
tooltip_fancy_mode=False) | |
bar_chart.title = 'Age Groups' | |
for index, value in ageGroups.iteritems(): | |
value = round(value,2) | |
bar_chart.add(index, value) | |
bar_chart.render_to_file('age.svg') |
Income Groups
ageGroups = miDonors.iloc[:,56:65].astype('float64').sum() / count | |
bar_chart = pygal.HorizontalBar(style=custom_style, | |
value_formatter=lambda x: '{}%'.format(x), | |
tooltip_fancy_mode=False) | |
bar_chart.title = 'Income Groups' | |
for index, value in ageGroups.iteritems(): | |
value = round(value,2) | |
bar_chart.add(index, value) | |
bar_chart.render_to_file('income.svg') |
Educational Attainment in Ages 25 and Over
education25 = miDonors.iloc[:,41:47].astype('float64').sum() / count | |
denom = education25.sum() | |
for index,value in education25.iteritems(): | |
education[index] = round((value/denom)*100,2) | |
dark_lighten_style = LightenStyle('#00213e') | |
bar_chart = pygal.Bar(style=custom_style, | |
value_formatter=lambda x: '{}%'.format(x), | |
tooltip_fancy_mode=False) | |
bar_chart.title = 'Educational Attainment - Ages 25 and over' | |
for index, value in education25.iteritems(): | |
value = round(value,2) | |
bar_chart.add(index, value) | |
bar_chart.render_to_file('education25.svg') |
Finally, a heatmap is made with the Python library folium. This library allows Leaflet webmaps to be made from Python.
heat_map = folium.Map(location=[43.455737, -84.760162], zoom_start=7, | |
max_zoom=14, tiles='cartodbpositron') | |
with open('dataInput//donors.csv','r', newline='') as f: | |
reader = csv.reader(f) | |
next(reader) # skip column headers | |
data = [[float(row[13]), float(row[12])] for row in reader if row[12] != ''] | |
heat = plugins.HeatMap(data, blur=25) | |
heat_map.add_child(heat) | |
heat_map.save('heatmap.html') |