The fst package for R provides a fast, easy and flexible way to serialize data frames. It has very amazing features, such as fast read and write of R data frames, super file compression and parse data frames without reading it. Considering all these features, now tidyfst could provide a new workflow to manipulate data more efficiently. The core idea is: We never need the whole data all at once, we only need the things we want and aggregate them to get the summary to provide target information.
tidyfst have provided the following functions to facilitate the workflow:
fst::read_fstbut always return a data.table
fst::write_fstbut always use largest compress factor (which yields smallest file)
In such a workflow, you never need to read the whole data.frame into your RAM, you just select the target data, process them instantly and get the results all at once. You do not have to read the data to know the structure of data.frame, because we have
parse_fst(a wrapper for
fst in fst package). Now let’s give it a try.
library(tidyfst) # Generate some random data frame with 10 million rows and various column types <- 1e7 nr_of_rows <- data.frame( df Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE), Integer = sample(1L:100L, nr_of_rows, replace = TRUE), Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE), Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE)) ) # write the fst file, make sure you do not have the file with same name in the directory export_fst(df,"fst_test.fst") # remove all variables in the environment rm(list = ls())
Now, we want to know the information in the data frame.
parse_fst("fst_test.fst") -> ft ft
If we want to get the information in the
Factor column, use:
%>% ft select_fst(Factor) %>% count_dt(Factor) -> factor_info factor_info
If we want to calculate the mean of
Integer by the group of
%>% ft select_fst(Integer,Factor) %>% summarise_dt(avg = mean(Integer),by = Factor) -> avg_info avg_info
In this workflow, we only select/filter/slice the data we need, and get the results directly from the pipeline. Therefore, we read the minimum needed data into RAM and release it and save only the results we want. This workflow could save memory for many exploratory big data analysis. Last, let’s delete the output file:
# delete the output file unlink("fst_test.fst")
After (>=) version 0.9.3, tidyfst has also added a function
as_fst(), which can turn any data.frame into a fst table and saved the data in the temporary file. This means that we might never have to save the object in the RAM ever (as long as it is a data.frame)! A small example:
%>% as_fst() -> iris_fst iris %>% as_fst() -> mtcars_fst mtcars iris_fstmtcars_fst
So when you have generated a pretty large data.frame and do not want it to consume the cache in your computer, just save it and read it when needed using