This tutorial accompanies the Small Town Big Data blog post How Steamboat stays connected: Mapping Facebook’s Social Connectedness Index.


Mapdeck. My newest love affair. I can’t wait to share the love with you all.

This tutorial offers an introduction to making interactive maps with Mapbox via the Mapdeck package, using Facebook’s Social Connectedness Index dataset. We’ll cover:

Software: If you came here for a GIS tutorial and are wondering why it’s in R….that’s because it’s my go-to GIS: I dare you to change my mind.

To start, I recommend downloading RStudio. Open up a new .R file and start cutting and pasting! We’ll be using a series of spatial libraries for data management and visualization. Remember that it takes a lot of time (years) and patience (saint-like) to get comfortable using these libraries. Cut and paste your little heart out from this tutorial, the tutorials I learned from, and your best friend StackOverflow.


Data: Facebook has open data. I know - Facebook and open data in the same sentence? Sounds like an oxymoron. But they share a series of social connectedness datasets here. Read more about the dataset and its various uses here. Basically, it describes the magnitude of friendship connections between two locations. It’s not an exact number of friendships, but an index of the frequency and density of friend connections that obscures the exact number for private-company-can-do-whatever-they-want reasons.

Click here and download the county_county_aug2020.tsv file and add it to a new /data folder in this repository

Click here and download the cb_2018_us_county_500k.zip file and unzip it in the /data folder in this repository

Feel free to download and work with one of their other SCI datasets: county to country or country to country. I’ve chosen to work with county to county data.

Time to start cleaning.

Loading .tsv data

# Set your working directory to the /data folder included in the repository using setwd(). 
# Install the following packages for this tutorial: 
# #install.packages(c("mapdeck", "rgdal", "sp", "rgeos", "classInt", "viridis", "gtools"))
library(rgdal)
library(sp)
library(mapdeck)
library(rgeos)
library(classInt)
library(viridis)
library(gtools)
# Load in the data. It's .tsv, which mean each value is separated by a tab. The function read.table has an option to set the separator to "/t", which means tab. 
connect<-read.table(file = 'data/county_county_aug2020.tsv', sep = '\t', header = TRUE)

# Look at the data. It's quite a large dataset so it may take a moment to print into your console. There are only three columns here: user location, friend location and Social Connectedness Index (SCI). Each row is a relationship between two counties. So each county may show up multiple times in each the user and friend columns.
nrow(connect)
## [1] 10426441
summary(connect)
##     user_loc         fr_loc        scaled_sci       
##  Min.   : 1001   Min.   : 1001   Min.   :1.000e+00  
##  1st Qu.:19039   1st Qu.:19039   1st Qu.:1.081e+03  
##  Median :30035   Median :30035   Median :1.942e+03  
##  Mean   :31514   Mean   :31514   Mean   :4.608e+04  
##  3rd Qu.:46123   3rd Qu.:46123   3rd Qu.:4.036e+03  
##  Max.   :78030   Max.   :78030   Max.   :1.000e+09

Classic choropleth: Plotting Routt County’s nationwide SCI

You know that map you see all the time? It’s the continental United States by county. Each county is a different color. Most of the time you see it, those counties are either red or blue. You’re actually probably sick of seeing that map, and there are lots of reasons why that is ‘bad maps’ but we’re going to make one anyways. It’s called a choropleth - where defined areas (counties) are colored by an attribute or value associated with that area.

I’ve included in the repository a widely use map of U.S. counties. Let’s take a look.

#Load in the county shapefile. This loads as a SpatialPolygonsDataframe, a standard data type we use often in R spatial. 
counties.shp<-readOGR("data/cb_2018_us_county_500k/cb_2018_us_county_500k.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\ncinglis\Desktop\stbd\connectedness\data\cb_2018_us_county_500k\cb_2018_us_county_500k.shp", layer: "cb_2018_us_county_500k"
## with 3233 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
plot(counties.shp)

##Oh my goodness. Why does the U.S. look so tiny?
#Oh wait, that's because of the maps we're so used to looking at.It's too easy to forget some very far away parts of the U.S. And I'm not talking Alaska and Hawaii, I'm talking Guam, American Samoa.
#It's important to recognize these territories are part of our country, though they are not 
#democratically or culturally included. We have to zoom out pretty far to rememeber that. 
#Take a good look before we zoom back out. 

##We're going to focus on the contiguous United States, known as the Lower 48, or CONUS. 
#Some people just think this is the U.S. See above rant. 
#Using the subset() function, we're exluding (!= means "DOES NOT EQUAL") the state FIPS codes that represent Alaska, Hawaii and the U.S. territories.
#What's a FIPS? Every state andcounty has a unique ID number that's going to come in handy for selecting and merging these data. Routt's is 8107: 8 is Colorado's state FIPS and 107 is Routt County.  

conus<-subset(counties.shp, STATEFP != "02" & STATEFP != "15" & STATEFP!= "60" &
                STATEFP != "66" & STATEFP!= "69" & STATEFP != "72" & STATEFP != "78")

plot(conus)


##First, let's pull just the connections that start with Routt County. 
#We'll end up with a dataframe with all the county-county connections that start with Routt. 
routt<-connect[connect$user_loc==8107,]

#Most social connections are within Routt County, so keeping that row in there really skews the plot. Let's take it out for the purpose of visualizing relationships to other counties. 
routt<-routt[routt$fr_loc != 8107,]

#Let's merge that into the CONUS spatial object. We want each county to have the SCI value associated with that county's connection to routt. So we'll merge by matching the end county (fr_loc) with the county FIPS value in the CONUS object ('GEOID').

routt.sci<-merge(conus, routt, by.x='GEOID', by.y='fr_loc')

#Uh. Oh. What's going on here? Well, let's look at the two columns we're merging by:
class(routt$fr_loc)
## [1] "integer"
class(conus$GEOID)
## [1] "character"
#OK, that's the issue. One is a numeric value and the CONUS field is in an integer format. We just have to make it numeric to match:
conus$GEOID<-as.numeric(conus$GEOID)
routt.sci<-merge(conus, routt, by.x='GEOID', by.y='fr_loc')


#Ready, set, plot.

options(scipen=10)

plot<-spplot(routt.sci, "scaled_sci", col.regions = inferno(16))
plot

#Yikes. The scaling of the SCI values is challenging  and it's hard to see the finer 
#differences among counties.Thanks to a little packages called 'classInt'  we can rescale 
#the color breaks and make the story more clear. 
#While we're at it, let's get rid of those dark county borders like all the cool kids do. 

breaks.jenks <- classIntervals(routt.sci$scaled_sci, n = 20, style = "jenks", intervalClosure="right")


coolerplot<- spplot(routt.sci, "scaled_sci", col.regions = inferno(20), widths = 5, heights = 4, col="transparent", at = breaks.jenks$brks, par.settings = list(axis.line = list(col ='transparent')))

coolerplot

##And splitting the breaks a different way, by "natural" or jenks breaks, we get an even more colorful map. However, notice how skewed the color scale is. The data are very skewed with a large range. The data are highly skewed with a large range. Here's what Facebook says about what those huge numbers actually mean. See the blog post or FB's metadata for more insight into what those numbers actually mean. 

breaks.qt <- classIntervals(routt.sci$scaled_sci, n = 20, style = "quantile", intervalClosure = "right")

othercoolerplot<- spplot(routt.sci, "scaled_sci", col.regions = inferno(20), widths = 5, heights = 4, col="transparent", at = breaks.qt$brks, par.settings = list(axis.line = list(col ='transparent')))

othercoolerplot


This is one way to visualize county-to-county connection day in relation to one county. But it doesn’t really give us that friendship connection vibe.

Also, as previously discussed, we are sick of U.S. county choropleth maps.

Mapdeck time: Interactive maps for visualizing network data

Mapdeck is an R package that lets you access all the extra goodness of the Javascript libraries Mapbox GL and Deck GL. You don’t need to know what that means. Think of it like any old Google map that you can scroll, zoom and click on. Except open source and you can fully customize it.

A huge shout out to David Cooley, Mapdeck’s author, for the top-notch documentation and tutorials. Well done.

Unlike the county choropleth, we need exact coordinates of the center of each county to show where each connection begins and ends. So we need to do a little data prep before we make our map.

#First thing's first, set your mapbox token with the following function: 
#Sys.setenv('MAPBOX_TOKEN' = 'YOUR TOKEN HERE')
#Directions for getting one are at the top of this tutorial. 

#Prep time
#Get the centroids of each county
centroids<-SpatialPointsDataFrame(gCentroid(conus, byid=T), conus@data, match.ID=FALSE)

#Make the lat and long their own columns (they're stored separately in a SPDF).
centroids$long<-centroids@coords[,1]
centroids$lat<-centroids@coords[,2]

##Select only the columns we need 
centroids.coords<-centroids[,c(1,5:6)]


#Back to selected connections that begin in Routt County from the main dataset. Feel free to select your own county and let me know what you find!
croutt<-connect[connect$user_loc==8107,]
merge.routt.centroid<-merge(croutt, centroids.coords, by.x='user_loc', by.y='GEOID')

#Rename state fips, county name, long and lat columns to make sure we designate them as the start positions
names(merge.routt.centroid)[4:7]<-c("start.state", "start.county", "start.x", "start.y")

#Merge the centroids again, but this time for the end point
merge.end<-merge(merge.routt.centroid, centroids.coords, by.x='fr_loc', by.y='GEOID')

#And rename 
names(merge.end)[8:11]<-c("end.state", "end.county", "end.x", "end.y")

#Cool, final dataframe we'll use. Notice how this dataframe isn't a spatial object. But that doesn't mean it's not spatial! 
d<-merge.end

#We'll need to cut the SCI values into quantiles again so the differences are apparent in visualization. 
q<- quantcut(d$scaled_sci, q=10, na.rm=TRUE)
d$q<-q

#Make the color palette match the one we used in the map
m<-colorRamp(inferno(10, direction=-1, alpha=.6))( (1:256)/256 )

#And here's our mapdeck object: 
mc<- mapdeck(style = mapdeck_style("dark"), pitch = 45 ) %>%
  add_arc(
    data = d
    , layer_id = "arc"
    , origin = c("start.x", "start.y")
    , destination = c("end.x", "end.y")
    , stroke_from = "q"
    , stroke_to = "q",
    stroke_to_opacity = 100,
    stroke_from_opacity = 100,
    stroke=6,
    legend=F,
    height=.7,
    palette=m,
    auto_highlight=T,
    tooltip="end.county"
  )

mc
## Registered S3 method overwritten by 'jsonlite':
##   method     from   
##   print.json jsonify

WOW.That’s a spaghetti map if I’ve ever seen one!

Maybe let’s narrow it down a bit. Let’s select the top 10% of Routt’s Social Connectedness Index buddies.

##Make a vector of the SCI values in the top 10%
highest10<-sort(d$scaled_sci, decreasing=TRUE)[1:(length(d$scaled_sci)/10)]

#Select only those rows in the dataset
dhigh<-d[d$scaled_sci %in% highest10,]

#Recut into quantiles for color scale
q10<- quantcut(dhigh$scaled_sci, q=10, na.rm=TRUE)
dhigh$q<-q10


mhigh<- mapdeck(style = mapdeck_style("dark"), pitch = 45, zoom=5 ) %>%
  add_arc(
    data = dhigh
    , layer_id = "arc"
    , origin = c("start.x", "start.y")
    , destination = c("end.x", "end.y")
    , stroke_from = "q"
    , stroke_to = "q",
    stroke_to_opacity = 100,
    stroke_from_opacity = 100,
    stroke=6,
    legend=F,
    height=.7,
    palette=m,
    auto_highlight=T,
    tooltip="end.county"
  )

mhigh

See all the potential here? Mapdeck can handle large datasets, and the elements are highly customizeable. I encourage you to try adding other counties’ SCI data in different colors. Or, try the county-country dataset to see where our international friends are!

There is literally no reason to do this but….let’s animate that !@#$ because we can

Disclaimer: ‘Because we can’ is a terrible reason to do anything in data viz. Just wanted to show you it’s an option in this package.

ma<- mapdeck(style = mapdeck_style("dark"), pitch = 45, zoom =5 ) %>%
  add_animated_arc(
    data = d
    , layer_id = "arc_layer"
    , origin = c("start.x", "start.y")
    , destination = c("end.x", "end.y")
    , stroke_from = "q"
    , stroke_to = "q",
    stroke_width = 3,
    animation_speed = 2,
    trail_length = 5,
    legend=F,
    height=.7,
    palette=m,
    auto_highlight=T,
    tooltip="end.county"
    
  )
## animated_arc is an experimental layer and the function may change without warning
ma

Learning to work with and plot vector data in R can commonly make you want to chuck your laptop out a window. I’ve been there, so if you have any questions at all, feel free to email me!.

Next post drops April 6.