The R language is often perceived as a language for statisticians and data scientists. Quite a long time ago, this was mostly true. However, over the years the flexibility R provides via packages has made R into a more general purpose language. R was open sourced in 1995, and since that time repositories of R packages are constantly growing. Still, compared to languages like Python, R is strongly based around the data.
Speaking about data, tabular data deserves particular attention, as it’s one of the most commonly used data types. It is a data type which corresponds to a table structure known in databases, where each column can be of a different type, and processing performance of that particular data type is the crucial factor for many applications.
In this article, we are going to present how to achieve tabular data transformation in an efficient manner. Many people who use R already for machine learning are not aware that data munging can be done faster in R, and that they do not need to use another tool for it.
High-performance Solution in R
Base R introduced the data.frame
class in the year 1997, which was based on S-PLUS before it. Unlike commonly used databases which store data row by row, R data.frame
stores the data in memory as a column-oriented structure, thus making it more cache-efficient for column operations which are common in analytics. Additionally, even though R is a functional programming language, it does not enforce that on the developer. Both opportunities have been well addressed by data.table
R package, which is available in CRAN repository. It performs quite fast when grouping operations, and is particularly memory efficient by being careful about materializing intermediate data subsets, such as materializing only those columns necessary for a certain task. It also avoids unnecessary copies through its reference semantics while adding or updating columns. The first version of the package has been published in April 2006, significantly improving data.frame
performance at that time. The initial package description was:
This package does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames. This package defines a new class data.table which operates just like a data.frame, but uses up to 10 times less memory, and can be up to 10 times faster to create (and copy). It also takes the opportunity to allow subset() and with() like expressions inside the []. Most of the code is copied from base functions with the code manipulating row.names removed.
Since then, both data.frame
and data.table
implementations have been improved, but data.table
remains to be incredibly faster than base R. In fact, data.table
isn’t just faster than base R, but it appears to be one of the fastest open-source data wrangling tool available, competing with tools like Python Pandas, and columnar storage databases or big data apps like Spark. Its performance over distributed shared infrastructure hasn’t been yet benchmarked, but being able to have up to two billion rows on a single instance gives promising prospects. Outstanding performance goes hand-in-hand with the functionalities. Additionally, with recent efforts at parallelizing time-consuming parts for incremental performance gains, one direction towards pushing the performance limit seems quite clear.
Data Transformation Examples
Learning R gets a little bit easier because of the fact that it works interactively, so we can follow examples step by step and look at the results of each step at any time. Before we start, let’s install the data.table
package from CRAN repository.
install.packages("data.table")
Useful hint: We can open the manual of any function just by typing its name with leading question mark, i.e. ?install.packages
.
Loading Data into R
There are tons of packages for extracting data from a wide range of formats and databases, which often includes native drivers. We will load data from the CSV file, the most common format for raw tabular data. File used in the following examples can be found here. We don’t have to bother about CSV
reading performance as the fread
function is highly optimized on that.
In order to use any function from a package, we need to load it with the library
call.
library(data.table)
DT <- fread("flights14.csv")
print(DT)
## year month day dep_delay arr_delay carrier origin dest air_time
## 1: 2014 1 1 14 13 AA JFK LAX 359
## 2: 2014 1 1 -3 13 AA JFK LAX 363
## 3: 2014 1 1 2 9 AA JFK LAX 351
## 4: 2014 1 1 -8 -26 AA LGA PBI 157
## 5: 2014 1 1 2 1 AA JFK LAX 350
## ---
## 253312: 2014 10 31 1 -30 UA LGA IAH 201
## 253313: 2014 10 31 -5 -14 UA EWR IAH 189
## 253314: 2014 10 31 -8 16 MQ LGA RDU 83
## 253315: 2014 10 31 -4 15 MQ LGA DTW 75
## 253316: 2014 10 31 -5 1 MQ LGA SDF 110
## distance hour
## 1: 2475 9
## 2: 2475 11
## 3: 2475 19
## 4: 1035 7
## 5: 2475 13
## ---
## 253312: 1416 14
## 253313: 1400 8
## 253314: 431 11
## 253315: 502 11
## 253316: 659 8
If our data is not well modeled for further processing, as they need to be reshaped from long-to-wide or wide-to-long (also known as pivot and unpivot) format, we may look at ?dcast
and ?melt
functions, known from reshape2 package. However, data.table
implements faster and memory efficient methods for data.table/data.frame class.
Querying with data.table
Syntax
If You’re Familiar with data.frame
Query data.table
is very similar to query data.frame
. While filtering in i
argument, we can use column names directly without the need to access them with the $
sign, like df[df$col > 1, ]
. When providing the next argument j
, we provide an expression to be evaluated in the scope of our data.table
. To pass a non-expression j
argument use with=FALSE
. Third argument, not present in data.frame
method, defines the groups, making the expression in j
to be evaluated by groups.
# data.frame
DF[DF$col1 > 1L, c("col2", "col3")]
# data.table
DT[col1 > 1L, .(col2, col3), ...] # by group using: `by = col4`
If You’re Familiar with Databases
Query data.table
in many aspects corresponds to SQL queries that more people might be familiar with. DT
below represents data.table
object and corresponds to SQLs FROM
clause.
DT[ i = where,
j = select | update,
by = group by]
[ having, ... ]
[ order by, ... ]
[ ... ] ... [ ... ]
Sorting Rows and Re-Ordering Columns
Sorting data is a crucial transformation for time series, and it is also imports for data extract and presentation. Sort can be achieved by providing the integer vector of row order to i
argument, the same way as data.frame
. First argument in query order(carrier, -dep_delay)
will select data in ascending order on carrier
field and descending order on dep_delay
measure. Second argument j
, as described in the previous section, defines the columns (or expressions) to be returned and their order.
ans <- DT[order(carrier, -dep_delay),
.(carrier, origin, dest, dep_delay)]
head(ans)
## carrier origin dest dep_delay
## 1: AA EWR DFW 1498
## 2: AA JFK BOS 1241
## 3: AA EWR DFW 1071
## 4: AA EWR DFW 1056
## 5: AA EWR DFW 1022
## 6: AA EWR DFW 989
To re-order data by reference, instead of querying data in specific order, we use set*
functions.
setorder(DT, carrier, -dep_delay)
leading.cols <- c("carrier","dep_delay")
setcolorder(DT, c(leading.cols, setdiff(names(DT), leading.cols)))
print(DT)
## carrier dep_delay year month day arr_delay origin dest air_time
## 1: AA 1498 2014 10 4 1494 EWR DFW 200
## 2: AA 1241 2014 4 15 1223 JFK BOS 39
## 3: AA 1071 2014 6 13 1064 EWR DFW 175
## 4: AA 1056 2014 9 12 1115 EWR DFW 198
## 5: AA 1022 2014 6 16 1073 EWR DFW 178
## ---
## 253312: WN -12 2014 3 9 -21 LGA BNA 115
## 253313: WN -13 2014 3 10 -18 EWR MDW 112
## 253314: WN -13 2014 5 17 -30 LGA HOU 202
## 253315: WN -13 2014 6 15 10 LGA MKE 101
## 253316: WN -13 2014 8 19 -30 LGA CAK 63
## distance hour
## 1: 1372 7
## 2: 187 13
## 3: 1372 10
## 4: 1372 6
## 5: 1372 7
## ---
## 253312: 764 16
## 253313: 711 20
## 253314: 1428 17
## 253315: 738 20
## 253316: 397 16
Most often, we don’t need both the original dataset and the ordered/sorted dataset. By default, the R language, similar to other functional programming languages, will return sorted data as new object, and thus will require twice as much memory as sorting by reference.
Subset Queries
Let’s create a subset dataset for flight origin “JFK” and month from 6 to 9. In the second argument, we subset results to listed columns, adding one calculated variable sum_delay
.
ans <- DT[origin == "JFK" & month %in% 6:9,
.(origin, month, arr_delay, dep_delay, sum_delay = arr_delay + dep_delay)]
head(ans)
## origin month arr_delay dep_delay sum_delay
## 1: JFK 7 925 926 1851
## 2: JFK 8 727 772 1499
## 3: JFK 6 466 451 917
## 4: JFK 7 414 450 864
## 5: JFK 6 411 442 853
## 6: JFK 6 333 343 676
By default, when subsetting dataset on single column data.table
will automatically create an index for that column. This results in real-time answers on any further filtering calls on that column.
Update Dataset
Adding a new column by reference is performed using the :=
operator, it assigns a variable into dataset in place. This avoids in-memory copy of dataset, so we don’t need to assign results to each new variable.
DT[, sum_delay := arr_delay + dep_delay]
head(DT)
## carrier dep_delay year month day arr_delay origin dest air_time
## 1: AA 1498 2014 10 4 1494 EWR DFW 200
## 2: AA 1241 2014 4 15 1223 JFK BOS 39
## 3: AA 1071 2014 6 13 1064 EWR DFW 175
## 4: AA 1056 2014 9 12 1115 EWR DFW 198
## 5: AA 1022 2014 6 16 1073 EWR DFW 178
## 6: AA 989 2014 6 11 991 EWR DFW 194
## distance hour sum_delay
## 1: 1372 7 2992
## 2: 187 13 2464
## 3: 1372 10 2135
## 4: 1372 6 2171
## 5: 1372 7 2095
## 6: 1372 11 1980
To add more variables at once, we can use DT[,
:=(sum_delay = arr_delay + dep_delay)]
syntax, similar to .(sum_delay = arr_delay + dep_delay)
when querying from dataset.
It is possible to sub-assign by reference, updating only particular rows in place, just by combining with i
argument.
DT[origin=="JFK",
distance := NA]
head(DT)
## carrier dep_delay year month day arr_delay origin dest air_time
## 1: AA 1498 2014 10 4 1494 EWR DFW 200
## 2: AA 1241 2014 4 15 1223 JFK BOS 39
## 3: AA 1071 2014 6 13 1064 EWR DFW 175
## 4: AA 1056 2014 9 12 1115 EWR DFW 198
## 5: AA 1022 2014 6 16 1073 EWR DFW 178
## 6: AA 989 2014 6 11 991 EWR DFW 194
## distance hour sum_delay
## 1: 1372 7 2992
## 2: NA 13 2464
## 3: 1372 10 2135
## 4: 1372 6 2171
## 5: 1372 7 2095
## 6: 1372 11 1980
Comments are closed, but trackbacks and pingbacks are open.