Skip to content

Commit 8753ac2

Browse files
committed
Blog Post.
1 parent 2c371d4 commit 8753ac2

File tree

2 files changed

+36
-0
lines changed

2 files changed

+36
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
title: Tidyr example - use gather to make long data
3+
author: Erik Palmore
4+
date: '2017-12-22'
5+
slug: tidyr-example-use-gather-to-make-long-data
6+
categories:
7+
- R
8+
- Examples
9+
tags:
10+
- tidyr
11+
- examples
12+
- tutorials
13+
- education
14+
- ipeds
15+
- regx
16+
- gather
17+
- long
18+
---
19+
20+
I had the chance to make a very clean example of making long data from wide data. After a conversation about enrollment trends and demographics, I wanted to look more closely at the composition of students a four-year institutions in St. Louis. The data file available from [IPEDS](https://nces.ed.gov/ipeds/) provided a variable for each race/ethnicity class, further broken down by gender and total, for each year. The wide from with 160+ variables is simply to hard do much exploration with, but the tidy, long version, makes it much easier to work with (including some nice quick ggplots).
21+
22+
![](images/ipeds.png)
23+
24+
After loading the CSV file, the first thing I do is rename the columns by finding/replacing with blank the portions that I don't need. They column names are just too long. Plus, I want to extract the year as its own variable, so it is easiest if I just remove everything after the year.
25+
```{r eval=FALSE}
26+
names(df) <- gsub("A_RV..Full.time.students..Undergraduate.total.|A..Full.time.students..Undergraduate.total.", "", names(df))
27+
```
28+
29+
Now that I have this, I can very easily use tidyr to "gather" the data to a long form and extract the year. First, the gather function is provided the name "Group" which will hold the variable name from those 160+ columns. "Enrollment" is the new column name that all of those values will be organized into. The "-UnitID", and "-Institution.Name" is there to indicate NOT to gather these columns into the two new, long columns.
30+
```{r eval=FALSE}
31+
df <- df %>% gather(Group, Enrollment, -UnitID, -Institution.Name) %>%
32+
extract(Group, c("Group", "Year"), "(.*)(201.$)")
33+
```
34+
The extract function is really great, and I'd been looking for something like this. It is like an advanced text-to-columns if you're familiar with that Excel feature. The first argument, "Group" is the column of data, the vector of c("Group", "Year") are the column names that will emerge, and the final text is a regular expression with the same number of capturing groups (in parenthesis) as th new columns. ".*" is basically everything before the year, and the "201.$" is the year for its own new column.
35+
36+
The resulting data is nicely organized into UnitID, Institution.Name, Group, Enrollment, and Year!

content/post/images/ipeds.png

113 KB
Loading

0 commit comments

Comments
 (0)