How to Import Tabular, Plain-Text Data Files

rstats workflow tutorial

Learn how to import (and export) various tabular data file formats such as CSV, TXT, and XLSX using a range of packages such as {utils}, {vroom}, {readr}, {data.table}, {readxl}, and {rio}. The article also includes best practices when specifying paths and a quick discussion on handling R specific formats such as RDS and Rdata.

Cedric Scherer https://cedricscherer.com (IZW Berlin)https://ecodynizw.github.io/
2023-12-07

There is one essential step to make use of the power of the R programming language to wrangle, analyze, visualize, and communicate our data: getting the data into R. In this blog post, we will show you multiple approaches for various tabular, plain-text file formats. Note that this blog post is focusing on tabular data — if you need to import spatial data files, have a look here.

File Paths

In this blog post, we will make use of files stored remotely but all the workflows would be the same if the file is placed in a local directory. There are multiple ways how to specify the path to a local file:

Absolute Paths: Start from the Root Directory

Relative Paths: Start from the Working Directory

R Projects

To simplify your life, make use of R projects. The .Rproj file can be used as a shortcut for opening the project directly from the filesystem with the working directory set to that path. This way, you do not need to set the working directory manually (which again can cause issues on different machines, systems etc.).

{here} Package

When you are working with nested directories and especially notebook formats such as Rmarkdown or Quarto, the {here} package is helpful to make sure that the same working directory is used, no matter if you run the code locally or if you render a notebook that is placed in a subfolder. Read more about the {here} package here.

Data Import

For the showcase of different I/O packages we use data on deforestation provided by OurWorldInData which is hosted on the TidyTuesday repository:

path_to_data <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/forest.csv"

Usually, you can specify the symbols used as separators (indicating columns), for quotation marks, for decimal points and more.

If the file would be stored locally, a relative path coudl look like this:

path_to_data <- "./data/raw/forest.csv"
path_to_data <- "data/raw/forest.csv"

## with the {here} package
path_to_data <- here::here("data", "raw", "forest.csv")

{utils} aka “base R”

The read.*() functions from the {utils} package return data frames.

read.table() is a generic function that can be used for all kinds of tabular data formats; read.csv() is a short-hand wrapper that calls read.delim() with sensible defaults for CSV files.

dat <- read.table(path_to_data, sep = ",", quote = "\"'", dec = ".")
dat <- read.csv(path_to_data)

Often (as in this case) we have column names that should be used. We can indicate that we want to use those variable names instead by setting header = TRUE:

dat <- read.csv(path_to_data, header = TRUE)

If you have “German” CSV files which use a semicolon as column separator (so strictly speaking no CSV files at all) and a , as decimal separator, you can use read.csv2() which uses sep = ";" and dec = "," as the defaults. There are also shorthand wrappers for tab-separated data formats (delimited files) named read.delim() and read.delim2().

You can directly specify the column types by passing a vector that contains the classes for all columns:

dat <- 
  read.csv(
    path_to_data, header = TRUE, 
    colClasses = c(code = "factor", year = "factor")
)

If you want to inspect your data, you can use the nrows argument to specify the maximum number of rows to import. If the file contains any rows that should be skipped before reading the data (often the case if the file contains also meta data as in governmental data or NetLogo outputs), you can specify the number of rows to skip via skip.

dat <- read.csv(path_to_data, nrows = 5, skip = 3)

Another argument you might commonly see is stringsAsFactors. As of version 4.0.0, the default behaviour has changed; R treats strings in data frames as strings rather than factors now. You can activate that functionality by setting stringsAsFactors = TRUE but this is usually not advised. Ig is likely better to specify factor columns via the colClasses argument or to convert them at a later point.

{vroom} and {readr}

The vroom() function from the {vroom} package and the read_*() functions from the {readr} package return tibbles (the “modern data frames” used within the tidyverse). They work mostly the same as the read.*() functions but are meant to be faster and come with several additional options to control imports. The {readr} functions call vroom() so they behave exactly the same (and might be merged into a single package at some point).

dat <- vroom::vroom(path_to_data)
dat <- readr::read_delim(path_to_data, delim = ",")
dat <- readr::read_csv(path_to_data)

There are shorthand wrappers for different file formats including:

Again, one can control the column types in the same step. vroom() and read_*() functions allow to specify the types via the col_*() functions which allow for additional arguments or as a very compact combination of shortcuts. The shortcuts are representing columns of type character c, double d, factor f, logical l, integer i, and date D.

library(vroom)

## specify types for certain columns
dat <- 
  vroom(path_to_data, col_types = cols(
    code = col_factor(), 
    year = col_factor(levels = c("1990", "2000", "2010", "2015"))
  ))

## specify types as shortcuts for all columns
dat <- vroom(path_to_data, col_types = "cffi")

{data.table}

The {data.table} package offers a “fast read” function that is especially recommended when working with large files with many rows.

dat <- data.table::fread(path_to_data)

Different file types can be imported by specifying the separator between columns via sep. Nested lists are also supported, then you have to specify sep2 to define the separator within columns.

As in the read.*() functions, one can specify the type of the columns right away. Multiple ways are possible:

## specified as in read.*()
dat <- data.table::fread(path_to_data, colClasses = c(code = "factor", year = "factor")) 

## columns specified by type with names
dat <- data.table::fread(path_to_data, colClasses = list(factor = c("code", "year")))

## columns specified by type with numbers
dat <- data.table::fread(path_to_data, colClasses = list(factor = 2:3))

{readxl}

Excel files have a very specific format and can contain multiple sheets. In general, it is advised to avoid Excel files whenever it is possible. If you have to work with XLS or XLSX files, you likely should not open the files with Excel (as it can introduce multiple issues as it may change column formats). Better import it into R to inspect and wrangle the data. You can use the {readxl} package to handle those files:

dat <- readxl::read_excel("data/data.xlsx")

dat <- readxl::read_xls("data/data.xls")
dat <- readxl::read_xlsx("data/data.xlsx")

The functions from the {readxl} package only work with paths to local files, not URLs. The example above uses some placeholder names, the files do not exists. If you know the type of your file, you may want to use read_xls() or read_xlsx(), respectively, as read_excel() is guessing the right format.

By default, the functions import the first sheet. You can specify the sheet to import by passing either a number or the name of the sheet:

## open sheet by number
dat <- readxl::read_xlsx("data/data.xlsx", sheet = 2)

## open sheet by name
dat <- readxl::read_xlsx("data/data.xlsx", sheet = "Table 2") 

{rio}

The {rio} package describes itself as “the Swiss army knife for data I/O”. The idea is that a single function can be used for any file format which is detected on the fly. The package is using several of the previously mentioned packages to achieve full flexibility with regard to file support.

data <- rio::import(path_to_data)

When passing a CSV file as in this example, the import() function uses fread() from the {data.table} package by default. The import() functions supports all kind of tabular data and also other file format including SAS, SPSS, Stata, R Objects, json, geojson, Apache Arrow, Feather and more.

Working with R-specific Formats

R provides two file formats for storing data: .RDS and .RData. RDS files, formerly known as RDA files which one should avoid nowadays, can store a only single R object while RData files can store multiple R objects. Another important difference is that RDS files need to be assigned while Rdata files are loaded and use the object name that has been picked by the person saving the file (for good or worse, as it may overwrite existing objects with the same name in your environment).

You can open a RDS file with readRDS() or readr::read_rds():

dat <- readRDS("data/file.Rds")
dat <- readr::read_rds("data/file.Rds")

Opening RData files is even easier, simply run the function load() with the file:

load("data/file.RData")

Data Export

All the packages mentioned (except for {readxl}) also offer functions to write your objects to disk.

utils::write.csv(dat, file = "data/file.csv")
vroom::vroom_write(dat, file = "data/file.csv")
readr::write_csv(dat, file = "data/file.csv")
data.table::fwrite(dat, file = "data/file.csv")
rio::export(dat, file = "data/file.csv")
save(dat, file = "data/file.Rdata")
saveRDS(dat, file = "data/file.Rds")

Citation

For attribution, please cite this work as

Scherer (2023, Dec. 7). Ecological Dynamics: How to Import Tabular, Plain-Text Data Files. Retrieved from https://ecodynizw.github.io/posts/data-import/

BibTeX citation

@misc{scherer2023how,
  author = {Scherer, Cedric},
  title = {Ecological Dynamics: How to Import Tabular, Plain-Text Data Files},
  url = {https://ecodynizw.github.io/posts/data-import/},
  year = {2023}
}