Nicolas

6 minute read

Last week, I came across this blog post from Yihui-down called save() vs saveRDS(). It was issued after a response from Jenny Bryan on Twitter.

So when more experimented people like them advise to use a tool, it is for a reason. So I decided to give it a try. Not to compare save() and saveRDS, but saveRDS to write_csv(). This is not a complete review as I lack the knowledge, but a short intro to something I found very useful.

As you might have red earlier on this blog, I’m working with school mates on a project about the social division of the residential space of the greater Mexico city. It involves a lot of data:

  • for 2015: 64 csv files, 22 billions lines, around 180 variables about housing and people, for a total around 7 GB of disk space.
  • for 2010: 32 xls files for 4,2 GB of disk space

It is for the whole United States of Mexico, so we filtered all the values, keeping only the ones regarding the Great Mexico Area then selected the variables we tought will help us describe the residential space, so around 40 of them.

Then we save it to a much smaller csv file. This 312 MB file contains 1,645,437 observations and 46 variables. As we work with git on this project and it is hosted on free gitlab instance called framagit, csv files are ignored to save space. So we compress this csv file so they are smaller and git can take them in account.

I knew it was not optimal, but we didn’t know alternatives.

Let’s see why saveRDS is more helpful.

saving an object

Let’s work with the famous mtcars dataset.

# 1. Loading 
data("mtcars")
# 2. Print
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Datatypes

For this example, all datatypes are <dbl>, so I change the datatype of hp as a factor. Let’s say it is a code for something else.

class(mtcars$hp)
## [1] "numeric"
mtcars$hp <- as.factor(mtcars$hp)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
class(mtcars$hp)
## [1] "factor"

saving into a csv

Let’s see what happen if I save to a csv and load it.

write.csv(mtcars, file = "mtcars.csv")
mtcars_csv <- read.csv("mtcars.csv")
head(mtcars_csv)
##                   X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
class(mtcars_csv$hp)
## [1] "integer"

As you can see, hp is now an integer, like cyl or vs. csv saving loses the datatypes. You can use read_csv() from the tidyverse to specify datatypes but you have to do it each time.

using saveRDS()

Now with the saveRDS function:

saveRDS(mtcars, file = "mtcars.rds")
mtcars_rds <- readRDS("mtcars.rds")
head(mtcars_rds)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
class(mtcars_rds$hp)
## [1] "factor"

As you can see saveRDS() saves datatypes, it also save encoding. For the mexican project, we work with data in Spanish so it uses the ISO 8859-1 encoding. So you don’t have struggle with that too.

compression

One nice feature is that saveRDS() compress the object by default (the option compress is set to TRUE). So a rsd file is much smaller than a csv file.

wc -c mtcars.csv
wc -c mtcars.rds
## 1847 mtcars.csv
## 1302 mtcars.rds

So for a dataset like this, it is not much (around 30%), but for my 324 MB file, it shrinks to 10MB.

It is a big gain. Plus I made git ignored the csv ans xls files to save time and storage on framagit (since it is a free service provided by a non profit), so I prefer not abuse. But a 10 MB is ok I guess, and git can see it. So I can transfer data to my colleagues this way.

code thightness

As a young datascientist, small code is not my priority, as sometimes longer one is more understandable.

But in that case, as saveRDS() and readRDS() are quite simple, they don’t need a long list of parameters like loading my big csv file with 46 variables (and most of them are factors), so it is a nice plus.

saveRDS() or save()

As more experimented people recommend to use saveRDS() rather than save() for safety reasons, I will use this. save will restore the object with its original name so if you have an object with the same name, it will be erased. With saveRDS() you can set a new name for the object.

For the moment, I don’t have to save multiple objects so it is fine for me to save it one by one. If you want to save multiple objects, please consider put them into a list with list() then save the list. So you can reload the list and get the object from it.

Conclusion

I would like to have known that earlier. It would have save me a lot of time juggling with big files, datatypes and encoding. So, now I’ll use saveRDS() to save objects.

Until I find something better of course ;)