-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path01_iterate_ingest.Rmd
138 lines (94 loc) · 6.01 KB
/
01_iterate_ingest.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
title: "Iterate Ingest"
author: "John Little"
date: "`r Sys.Date()`"
output: html_notebook
---
In this lesson we'll learn how to ingest multiple CSV files into a single data frame, also known as a *tibble*. To do this, I'll use a file system package, `fs` that helps manipulate the files. I'll introduce the technique of mapping over rows using the `purrr` package, but the purr lesson come later.
## Load library packages
```{r}
library(tidyverse)
library(fs)
```
## Iterate over files
1. Get a list of CSV files in a particular directory
2. Use the `read_csv` function to ingest all those files into a single tibble.
Since the `read_csv` function is vectorized, this task only requires putting the filenames into a character vector.
```{r}
my_files <- fs::dir_ls("data", glob = "*.csv")
my_files
```
```{r}
my_big_df <- read_csv(my_files, id = "filename")
```
Since all the files have the same formatting and structure, it's easy. Throughout this series I'll introduce techniques to wrangle untidy, inconsistent, or messy files into a clean and tidy tibble.
```{r}
my_big_df
# count(rideable_type) %>%
```
```{r}
readr::read_csv(my_files, id = "filename",
col_types = cols(start_station_id = col_character()))
```
## purrr - first look
As noted, not all files are easily ingested into a single tibble. One common problem is when we have consistent data variables but due to inconsistencies in the data, the `read_csv` function makes inconsistent guess as to the data types of the variables. This causes errors. We can solve this problem by iterating over each file more systematically while forcing `read_csv` to assign data types to the same variables across files.
R is a functional language and `read_csv` is one function among many that exist within the thousands of packages developed by the R community. We can also create customized functions. But we may still need to apply a function to a list of data. This is where the `purrr` package comes in, allowing us to use the `map_` family of functions. These are similar to the *apply* family of functions but often more consistently documented and by extention often easier to use.
But before we get in too deep on functions, let me demonstrate how a function can be mapped across a list of data.
One way to use `map()` is to iterate a function,row by row over a tibble. Many R and tidyverse functions do this well so this initial example may seem a bit verbose. Stay tuned and you'll see how it can all be put together.
Example:\
`map(my_df, my_func)`
In this case, we want to force the column data type using a `read_csv()` argument: `col_types`. We can still do this without `purrr::map()` but the code chunk below introduces you to the functions within `map()`. The purrr functions are powerful. We will talk more about this concept.
### map
```{r}
foo_df <- map_df(my_files, read_csv,
col_types = cols(start_station_id = col_character(),
end_station_id = col_character()),
id = "filename")
foo_df
```
### Excel
You can do a similar iteration with excel workbooks and worksheests. A well composed [article explains these excel workflows](https://readxl.tidyverse.org/articles/readxl-workflows.html) combining the power of the `purrr` and `readxl` packages. Lessons five and six will show more examples of working with multiple excel files.
## Examples and tips of working with files found on the internet.
For the remainder of this lesson, we'll simply apply the idea of download data from the web and iterating over the files during ingest. We'll make a nice little visualization that is more satisfying than looking at a raw table of data. I use the `fs` library to manipulate the file system. Since the files are relatively large, I delete them at the end.
### Compressed files can remain compressed
Get IMDB data from <https://www.imdb.com/interfaces/>
Even though these TSV (Tab Separated Values) files are compressed (.gz), they're still large. It's nice to know that the `readr` functions will uncompress the files -- so you don't have to.
```{r}
fs::dir_create("data")
if(!fs::file_exists("data/title.ratings.tsv.gz")) {
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz", "data/title.ratings.tsv.gz")
}
if(!fs::file_exists("data/name.basics.tsv.gz")) {
download.file("https://datasets.imdbws.com/name.basics.tsv.gz", "data/name.basics.tsv.gz")
}
```
## clean names
The `janitor` packages has a handy function: `clean_names()`. This helps keep your variable names tidy.
```{r}
imdb_df <- read_tsv("data/title.ratings.tsv.gz", n_max = 1000)
imdb_df
imdb_df |>
janitor::clean_names()
imdb_names_df <- read_tsv("data/name.basics.tsv.gz", n_max = 1000) |>
janitor::clean_names()
imdb_names_df
```
## Multivalue fields
Introducing `separate_rows` and `separate()`, plus a very early view at using a regular expression, or regex to find patterns in text. In this case, counting the commas in a multivalue field.
```{r}
imdb_names_df |>
separate_rows(primary_profession) |>
count(primary_profession, sort = TRUE)
imdb_names_df |>
mutate(prolficacy = str_count(known_for_titles, ",") + 1) |>
separate(known_for_titles, into = c("foo_1", "foo_2", "foo_3", "foo_4"), sep = ",")
```
## Tall/Long data
The process of using separate_rows has the effect of making the tibble **tall** rather than wide. Making data tall or wide is typically accomplished with the `pivot_longer` and `pivot_wider` functions. A good reason to pivot tall is to leverage the convenience of iteration. In the example below, `separate_rows` effectively pivots a 1000 observations (rows) tibble into a nearly 3000 rows tibble. In this way, separate_rows is a specialized compound function using `separate()`, `pivot_longer()`, and `drop_na()`. By making these transformations, it becomes easier iterate in ggplot2, ie. to calculate the frequency of categories for a bar plot.
```{r}
imdb_names_df
imdb_names_df |>
separate_rows(primary_profession, sep = ",") |>
ggplot(aes(y = fct_rev(fct_infreq(primary_profession)))) +
geom_bar()
```