**If there is one thing you learn soon as a data scientist, it is that problem solving gets an extra dimension as the data volume grows. One typical example is building recommendation engines.**

**A very basic form of a recommendation engine can be built using just a simple matrix algebra. But the situation quickly changes when analyzing data about many thousands customers, who are buying or rating several hundreds of products, which generates large and so called sparse data sets (where a lot of customer-item combinations do not have any value assigned). There are two main problems to overcome – how to store these large sparse matrices and how to run quick calculations over them.**

**In the following post, I will describe how to approach this problem in R using the package Matrix. The package allows to store large matrices in R’s virtual memory, supports standard matrix operations (transpose, matrix multiplication, element-wise multiplication etc.) and also provides a nice toolkit to develop new custom functions needed for recommendation engines as well as for other applications (here or here) where sparse matrices are used.**

## Sparse matrices and their application

As mentioned above, there are two main reasons for storing sparse matrix data in a suitable format. First, significant savings in storage space can be achieved. The chart below compares size of matrices stored in the standard way and as a sparse matrix. 10000*10000 matrix with 10% of non-zero values takes about 6.7 times less memory when stored specifically as a sparse matrix.

Secondly, significant savings in computing time can be achieved. The picture below illustrates a simple example of multiplication of a sparse matrix X with its transposition. In case when X is treated as a dense matrix 5*5*4=100 multiplications operations and 4*4*3=48 summations are needed to calculate the matrix product. When treated as a sparse matrix only 8 multiplications and 2 summations are required.

The chart bellow compares computing efficiency of using sparse matrices against the standard dense format in case of multiplication of two randomly generated matrices (sparsity = 0.1) for different sizes of the matrices. Note that the y scale is logarithmic.

The code below was used to generate comparison plots above.

```{r} library(Matrix) library(ggplot2) library(reshape2) ``` ```{r} set.seed(123) # memory_df <- data.frame(columns = 10^(1:6), # rows = 10^(1:6), # non_zero_entries = 0) memory_df <- data.frame(columns = 10^(2:4), rows = 10^(2:4)) # non_zero_entries = 0) memory_df['non_zero_entries'] <- (memory_df$columns * memory_df$rows) * 0.1 memory_df$sparse <- NA memory_df$standard <- NA for (i in 1:nrow(memory_df)){ sparse_matrix <- rsparsematrix(memory_df$rows[i], memory_df$columns[i], nnz = memory_df$non_zero_entries[i]) memory_df$sparse[i] <- object.size(sparse_matrix) rm(sparse_matrix) try(standard_matrix <- matrix(0, memory_df$rows[i], memory_df$columns[i])) memory_df$standard[i] <- ifelse(exists('standard_matrix'), object.size(standard_matrix), 0) rm(standard_matrix) } memory_plot <- melt(memory_df[, c(1, 4, 5)], id.vars=c('columns')) gg_memory_plot <- ggplot(memory_plot, aes(x=as.factor(columns), y=value, fill=variable)) + geom_bar(stat="identity", position=position_dodge()) + # scale_y_log10() + ggtitle('Comparison of memory size for standard and \n sparse (sparsity = 0.1) square matrices') + # ylab('Size [Bytes]; log-scale!') + ylab('Size [Bytes]') + xlab('Number of columns (and rows) in a square matrix') ggsave("gg_memory_plot.png", gg_memory_plot) gg_memory_plot ``` # Speed of matrix multiplication ```{r} set.seed(123) speed_df <- data.frame(columns = 10^(2:4), rows = 10^(2:4)) # non_zero_entries = 10^(2:4)) speed_df['non_zero_entries'] <- (speed_df$columns * speed_df$rows) * 0.1 speed_df$sparse_user_time <- NA speed_df$sparse_system_sime <- NA speed_df$sparse_elapsed_time <- NA speed_df$standard_user_time <- NA speed_df$standard_system_sime <- NA speed_df$standard_elapsed_time <- NA for (i in 1:nrow(speed_df)){ sparse_matrix1 <- rsparsematrix(speed_df$rows[i], speed_df$columns[i], nnz = speed_df$non_zero_entries[i]) sparse_matrix2 <- rsparsematrix(speed_df$rows[i], speed_df$columns[i], nnz = speed_df$non_zero_entries[i]) system_time <- system.time(sparse_matrix1 %*% sparse_matrix2) speed_df$sparse_user_time[i] <- system_time[1] speed_df$sparse_system_sime[i] <- system_time[2] speed_df$sparse_elapsed_time[i] <- system_time[3] rm(sparse_matrix1, sparse_matrix2) standard_matrix1 <- matrix(0, speed_df$rows[i], speed_df$columns[i]) standard_matrix2 <- matrix(0, speed_df$rows[i], speed_df$columns[i]) system_time <- system.time(standard_matrix1 %*% standard_matrix2) speed_df$standard_user_time[i] <- system_time[1] speed_df$standard_system_sime[i] <- system_time[2] speed_df$standard_elapsed_time[i] <- system_time[3] rm(standard_matrix1, standard_matrix2) } speed_df speed_plot <- melt(speed_df[, c(1, 6, 9)], id.vars=c('columns')) gg_speed_plot <- ggplot(speed_plot, aes(x=as.factor(columns), y=value, fill=variable)) + geom_bar(stat="identity", position=position_dodge()) + scale_y_log10(breaks = breaks) + ggtitle('Comparison of calculation speed for standard and \n sparse (sparsity = 0.1) square matrices multiplication') + ylab('Time [s]; log-scale!') + xlab('Number of columns (and rows) in a square matrix') ggsave("gg_speed_plot.png", gg_speed_plot) gg_speed_plot ```

## Basic manipulation with sparse matrices in the Matrix package

In this part we show some basic matrix manipulation utilizing Matrix package, which is one of the available packages designed to work with sparse matrices. In the code below we define a simple matrix, show how it can be converted into a sparse matrix and then do some basic manipulation with it. Note that the classes from the Matrix matrix package support basic matrix algebra operations and therefore you can work with them as with normal matrices.

# generate a matrix M0 <- matrix(0, nrow = 4, ncol = 5) M0[1,1] <- 5 M0[3,1] <- 7 M0[4,2] <- 4 M0[2,3] <- 1 M0[3,5] <- 1 M0[4,5] <- 1 # load the Matrix package library(Matrix) # convert it to sparse matrix M <- as(M0, "sparseMatrix") M #>4 x 5 sparse Matrix of class "dgCMatrix" #> #>[1,] 5 . . . . #>[2,] . . 1 . . #>[3,] 7 . . . 1 #>[4,] . 4 . . 1 M %*% t(M) #>4 x 4 sparse Matrix of class "dgCMatrix" #> #>[1,] 25 . 35 . #>[2,] . 1 . . #>[3,] 35 . 50 1 #>[4,] . . 1 17 M + 10 #>4 x 5 Matrix of class "dgeMatrix" #> #> [,1] [,2] [,3] [,4] [,5] #>[1,] 15 10 10 10 10 #>[2,] 10 10 11 10 10 #>[3,] 17 10 10 10 11 #>[4,] 10 14 10 10 11

## How sparse matrices are stored using the Matrix package

When you print the matrix by simply typing `M`

in your console, beside the matrix itself, you obtain also the information that it is a *4 x 5 sparse Matrix of class “dgCMatrix”*. Class details can be of course obtained also by using `class(M)`

. You will encounter variety of other classes when working with the Matrix package. If you wonder about the reason for such a number of classes, try to keep in mind that the main purpose of working with matrices in the sparse format is the calculation efficiency and therefore different methods might be suitable for different kinds of sparse matrices.

**A small detour to the S4 object-oriented model**

There are several object-oriented (OO) systems in R (see this great tutorial on the topic from Hadley Wickham). Most of the packages on CRAN use **S3** OO system. The Matrix package however uses the **S4** OO system, which has several practical consequences, which I will mention later. For the moment, it is important to note that:

- In S3
`$`

is used for accessing*values*of an object. - In S4
`@`

is used for accessing*slots*of an object.

**Obtaining help**

The help page might be another a bit confusing thing for those not familiar with the S4 OO system. You can access the help page by `?"dgCMatrix-class"`

. Under `Slots`

you can find what is the meaning of `.@x`

, but for all other slots you have to follow the link to “CsparseMatrix” as all other slots are inherited from this superclass (i.e. these other slots are common to the `dgCMatrix`

class as well as other classes). The following list summarizes slots available for an object of the `dgCMatrix`

class. Note that for both `.@i`

and `.@p`

zero-based numbering is used.

– ** M@x**: values of non-zero elements of the matrix.

–

**: positions within each column where you can find the non-zero elements.**

`M@i`

–

**: for each column initial position of that column within both vectors**

`M@p`

`M@x`

and `M@p`

.–

**: dimensions of the matrix.**

`M@dim`

M #>4 x 5 sparse Matrix of class "dgCMatrix" #> #>[1,] 5 . . . . #>[2,] . . 1 . . #>[3,] 7 . . . 1 #>[4,] . 4 . . 1 M@x #>[1] 5 7 4 1 1 1 M@i #>[1] 0 2 3 1 2 3 M@p #>[1] 0 2 3 4 4 6 M@Dim #>[1] 4 5 diff(M@p) #>[1] 2 1 1 0 2

You can find the actual number of non-zero items in columns using `diff(M@p)`

. Note that this follows from the fact that the same numbers will appear in the `M@p`

vector for two consecutive columns when there is no non-zero item in the first of the two columns.

## Sparse matrix representation in the Matrix package

In this part, I will show, how the elements of a sparse matrix can be accessed using three different representations of the matrix that the Matrix package allows.

### Standard dense matrix representation

Here the standard slicing is used to access a column.

# pick a column column <- 1 # extract the column M[,column] #>[1] 5 0 7 0 # extract non-zero elements and their indices M[,column][M[,column] != 0] #>[1] 5 7 which(M[,column] != 0) #>[1] 1 3

### Long format representation (also Coordinate list (COO))

In the long format, elements of a matrix are stored as *row number – column number – value* triplets. This representation can be accessed using the `summary`

method.

# print the matrix in the long format representation summary(M) #>4 x 5 sparse Matrix of class "dgCMatrix", with 6 entries #> i j x #>1 1 1 5 #>2 3 1 7 #>3 4 2 4 #>4 2 3 1 #>5 3 5 1 #>6 4 5 1 # extract non-zero elements and their indices summary(M)$x[summary(M)$j == column] #>[1] 5 7 summary(M)$i[summary(M)$j == column] #>[1] 1 3

### Compressed column-oriented representation (similar to Compressed sparse column (CSC or CCS))

The last representation is actually how the matrix is stored in the `dgCMatrix`

sparse matrix class and uses slots described in the first part of this post. Although the code for slicing might look a bit more complicated compared to previous two representations, this way of storing data is the one suitable for sparse matrix operations from the computing point of view.

# get number of non-zero items in the column items_in_column <- diff(M@p)[column] items_in_column #>[1] 2 # get indices of items within the column (add 1 due to zero-based numbering) if(items_in_column != 0){ M@i[ (M@p[column]+1):(M@p[column] + items_in_column) ] + 1 }else{ NA } #>[1] 1 3 # get values of items within the column if(items_in_column != 0){ M@x[ (M@p[column]+1):(M@p[column] + items_in_column) ] }else{ NA } #>[1] 5 7