This tutorial accompanies the Small Town Big Data blog post How Steamboat stays connected: Mapping Facebook’s Social Connectedness Index.
# Set your working directory to the /data folder included in the repository using setwd().
# Install the following packages for this tutorial:
# #install.packages(c("mapdeck", "rgdal", "sp", "rgeos", "classInt", "viridis", "gtools"))
library(rgdal)
library(sp)
library(mapdeck)
library(rgeos)
library(classInt)
library(viridis)
library(gtools)
# Load in the data. It's .tsv, which mean each value is separated by a tab. The function read.table has an option to set the separator to "/t", which means tab.
connect<-read.table(file = 'data/county_county_aug2020.tsv', sep = '\t', header = TRUE)
# Look at the data. It's quite a large dataset so it may take a moment to print into your console. There are only three columns here: user location, friend location and Social Connectedness Index (SCI). Each row is a relationship between two counties. So each county may show up multiple times in each the user and friend columns.
nrow(connect)
## [1] 10426441
summary(connect)
## user_loc fr_loc scaled_sci
## Min. : 1001 Min. : 1001 Min. :1.000e+00
## 1st Qu.:19039 1st Qu.:19039 1st Qu.:1.081e+03
## Median :30035 Median :30035 Median :1.942e+03
## Mean :31514 Mean :31514 Mean :4.608e+04
## 3rd Qu.:46123 3rd Qu.:46123 3rd Qu.:4.036e+03
## Max. :78030 Max. :78030 Max. :1.000e+09
You know that map you see all the time? It’s the continental United States by county. Each county is a different color. Most of the time you see it, those counties are either red or blue. You’re actually probably sick of seeing that map, and there are lots of reasons why that is ‘bad maps’ but we’re going to make one anyways. It’s called a choropleth - where defined areas (counties) are colored by an attribute or value associated with that area.
I’ve included in the repository a widely use map of U.S. counties. Let’s take a look.
#Load in the county shapefile. This loads as a SpatialPolygonsDataframe, a standard data type we use often in R spatial.
counties.shp<-readOGR("data/cb_2018_us_county_500k/cb_2018_us_county_500k.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\ncinglis\Desktop\stbd\connectedness\data\cb_2018_us_county_500k\cb_2018_us_county_500k.shp", layer: "cb_2018_us_county_500k"
## with 3233 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
plot(counties.shp)
##Oh my goodness. Why does the U.S. look so tiny?
#Oh wait, that's because of the maps we're so used to looking at.It's too easy to forget some very far away parts of the U.S. And I'm not talking Alaska and Hawaii, I'm talking Guam, American Samoa.
#It's important to recognize these territories are part of our country, though they are not
#democratically or culturally included. We have to zoom out pretty far to rememeber that.
#Take a good look before we zoom back out.
##We're going to focus on the contiguous United States, known as the Lower 48, or CONUS.
#Some people just think this is the U.S. See above rant.
#Using the subset() function, we're exluding (!= means "DOES NOT EQUAL") the state FIPS codes that represent Alaska, Hawaii and the U.S. territories.
#What's a FIPS? Every state andcounty has a unique ID number that's going to come in handy for selecting and merging these data. Routt's is 8107: 8 is Colorado's state FIPS and 107 is Routt County.
conus<-subset(counties.shp, STATEFP != "02" & STATEFP != "15" & STATEFP!= "60" &
STATEFP != "66" & STATEFP!= "69" & STATEFP != "72" & STATEFP != "78")
plot(conus)
##First, let's pull just the connections that start with Routt County.
#We'll end up with a dataframe with all the county-county connections that start with Routt.
routt<-connect[connect$user_loc==8107,]
#Most social connections are within Routt County, so keeping that row in there really skews the plot. Let's take it out for the purpose of visualizing relationships to other counties.
routt<-routt[routt$fr_loc != 8107,]
#Let's merge that into the CONUS spatial object. We want each county to have the SCI value associated with that county's connection to routt. So we'll merge by matching the end county (fr_loc) with the county FIPS value in the CONUS object ('GEOID').
routt.sci<-merge(conus, routt, by.x='GEOID', by.y='fr_loc')
#Uh. Oh. What's going on here? Well, let's look at the two columns we're merging by:
class(routt$fr_loc)
## [1] "integer"
class(conus$GEOID)
## [1] "character"
#OK, that's the issue. One is a numeric value and the CONUS field is in an integer format. We just have to make it numeric to match:
conus$GEOID<-as.numeric(conus$GEOID)
routt.sci<-merge(conus, routt, by.x='GEOID', by.y='fr_loc')
#Ready, set, plot.
options(scipen=10)
plot<-spplot(routt.sci, "scaled_sci", col.regions = inferno(16))
plot
#Yikes. The scaling of the SCI values is challenging and it's hard to see the finer
#differences among counties.Thanks to a little packages called 'classInt' we can rescale
#the color breaks and make the story more clear.
#While we're at it, let's get rid of those dark county borders like all the cool kids do.
breaks.jenks <- classIntervals(routt.sci$scaled_sci, n = 20, style = "jenks", intervalClosure="right")
coolerplot<- spplot(routt.sci, "scaled_sci", col.regions = inferno(20), widths = 5, heights = 4, col="transparent", at = breaks.jenks$brks, par.settings = list(axis.line = list(col ='transparent')))
coolerplot
##And splitting the breaks a different way, by "natural" or jenks breaks, we get an even more colorful map. However, notice how skewed the color scale is. The data are very skewed with a large range. The data are highly skewed with a large range. Here's what Facebook says about what those huge numbers actually mean. See the blog post or FB's metadata for more insight into what those numbers actually mean.
breaks.qt <- classIntervals(routt.sci$scaled_sci, n = 20, style = "quantile", intervalClosure = "right")
othercoolerplot<- spplot(routt.sci, "scaled_sci", col.regions = inferno(20), widths = 5, heights = 4, col="transparent", at = breaks.qt$brks, par.settings = list(axis.line = list(col ='transparent')))
othercoolerplot
This is one way to visualize county-to-county connection day in relation to one county. But it doesn’t really give us that friendship connection vibe.
Also, as previously discussed, we are sick of U.S. county choropleth maps.
Mapdeck is an R package that lets you access all the extra goodness of the Javascript libraries Mapbox GL and Deck GL. You don’t need to know what that means. Think of it like any old Google map that you can scroll, zoom and click on. Except open source and you can fully customize it.
A huge shout out to David Cooley, Mapdeck’s author, for the top-notch documentation and tutorials. Well done.
Unlike the county choropleth, we need exact coordinates of the center of each county to show where each connection begins and ends. So we need to do a little data prep before we make our map.
#First thing's first, set your mapbox token with the following function:
#Sys.setenv('MAPBOX_TOKEN' = 'YOUR TOKEN HERE')
#Directions for getting one are at the top of this tutorial.
#Prep time
#Get the centroids of each county
centroids<-SpatialPointsDataFrame(gCentroid(conus, byid=T), conus@data, match.ID=FALSE)
#Make the lat and long their own columns (they're stored separately in a SPDF).
centroids$long<-centroids@coords[,1]
centroids$lat<-centroids@coords[,2]
##Select only the columns we need
centroids.coords<-centroids[,c(1,5:6)]
#Back to selected connections that begin in Routt County from the main dataset. Feel free to select your own county and let me know what you find!
croutt<-connect[connect$user_loc==8107,]
merge.routt.centroid<-merge(croutt, centroids.coords, by.x='user_loc', by.y='GEOID')
#Rename state fips, county name, long and lat columns to make sure we designate them as the start positions
names(merge.routt.centroid)[4:7]<-c("start.state", "start.county", "start.x", "start.y")
#Merge the centroids again, but this time for the end point
merge.end<-merge(merge.routt.centroid, centroids.coords, by.x='fr_loc', by.y='GEOID')
#And rename
names(merge.end)[8:11]<-c("end.state", "end.county", "end.x", "end.y")
#Cool, final dataframe we'll use. Notice how this dataframe isn't a spatial object. But that doesn't mean it's not spatial!
d<-merge.end
#We'll need to cut the SCI values into quantiles again so the differences are apparent in visualization.
q<- quantcut(d$scaled_sci, q=10, na.rm=TRUE)
d$q<-q
#Make the color palette match the one we used in the map
m<-colorRamp(inferno(10, direction=-1, alpha=.6))( (1:256)/256 )
#And here's our mapdeck object:
mc<- mapdeck(style = mapdeck_style("dark"), pitch = 45 ) %>%
add_arc(
data = d
, layer_id = "arc"
, origin = c("start.x", "start.y")
, destination = c("end.x", "end.y")
, stroke_from = "q"
, stroke_to = "q",
stroke_to_opacity = 100,
stroke_from_opacity = 100,
stroke=6,
legend=F,
height=.7,
palette=m,
auto_highlight=T,
tooltip="end.county"
)
mc
## Registered S3 method overwritten by 'jsonlite':
## method from
## print.json jsonify
Maybe let’s narrow it down a bit. Let’s select the top 10% of Routt’s Social Connectedness Index buddies.
##Make a vector of the SCI values in the top 10%
highest10<-sort(d$scaled_sci, decreasing=TRUE)[1:(length(d$scaled_sci)/10)]
#Select only those rows in the dataset
dhigh<-d[d$scaled_sci %in% highest10,]
#Recut into quantiles for color scale
q10<- quantcut(dhigh$scaled_sci, q=10, na.rm=TRUE)
dhigh$q<-q10
mhigh<- mapdeck(style = mapdeck_style("dark"), pitch = 45, zoom=5 ) %>%
add_arc(
data = dhigh
, layer_id = "arc"
, origin = c("start.x", "start.y")
, destination = c("end.x", "end.y")
, stroke_from = "q"
, stroke_to = "q",
stroke_to_opacity = 100,
stroke_from_opacity = 100,
stroke=6,
legend=F,
height=.7,
palette=m,
auto_highlight=T,
tooltip="end.county"
)
mhigh
See all the potential here? Mapdeck can handle large datasets, and the elements are highly customizeable. I encourage you to try adding other counties’ SCI data in different colors. Or, try the county-country dataset to see where our international friends are!
Disclaimer: ‘Because we can’ is a terrible reason to do anything in data viz. Just wanted to show you it’s an option in this package.
ma<- mapdeck(style = mapdeck_style("dark"), pitch = 45, zoom =5 ) %>%
add_animated_arc(
data = d
, layer_id = "arc_layer"
, origin = c("start.x", "start.y")
, destination = c("end.x", "end.y")
, stroke_from = "q"
, stroke_to = "q",
stroke_width = 3,
animation_speed = 2,
trail_length = 5,
legend=F,
height=.7,
palette=m,
auto_highlight=T,
tooltip="end.county"
)
## animated_arc is an experimental layer and the function may change without warning
ma
Learning to work with and plot vector data in R can commonly make you want to chuck your laptop out a window. I’ve been there, so if you have any questions at all, feel free to email me!.
Next post drops April 6.