Coming from a Matlab background, I found sparse matrices to be easy to use and well integrated into the language. However, when transitioning to python’s scientific computing ecosystem, I had a harder time using sparse matrices. This post is intended to help Matlab refugees and those interested in using sparse matricies in python (in particular, for clustering)

Requirements:

scikit-learn (2.10+)

**numpy (refer to scikit-learn version requirements)**

scipy (refer to scikit-learn version requirements)

**Sparse Matrix Types:**

There are six types of sparse matrices implemented under scipy:

- bsr_matrix -- block sparse row matrix
- coo_matrix -- sparse matrix in coordinate format
- csc_matrix -- compressed sparse column matrix
- csr_matrix -- compressed sparse row matrix
- dia_matrix -- sparse matrix with diagonal storage
- dok_matrix -- dictionary of keys based sparse matrix
- lil_matrix -- row-based linked list sparse matrix

For more info see:

**(http://docs.scipy.org/doc/scipy/reference/sparse.html):**

**When to use which matrix:**The following are scenarios when you would want to choose one sparse matrix type over the another:*Fast Arithmetic Operation*: csc_matrix, csr_matrix*Fast Column Slicing (e.g., A[:, 1:2]): *csc_matrix*Fast Row Slicing (e.g., A[1:2, :]) *csr_matrix*Fast Matrix vector products: *csr_matrix, bsr_matrix, csc_matrix*Fast Changing of sparsity (e.g., adding entries to matrix): *lil_matrix, dok_matrix*Fast conversion to other sparse formats: *coo_matrix*Constructing Large Sparse Matrices: *coo_matrix

**Clustering with scikit-learn:**

**With the release of scikit-learn 2.10, one of the useful new features is the support for sparse matrices with the k-means algorithm. The following is how you would use sparse matrices with k-means:**

**Full Matrix to Sparse Matrix**

**Example-1**

-----------------------------------------------------------------------------------------------

from numpy.random import random from scipy.sparse import *

from sklearn.cluster import KMeans

**# create a 30x1000 dense matrix random matrix. **

D = random((30,1000))

**# keep entries with value < 0.10 (10% of entries in matrix will be non-zero)**

**# X is a "full" matrix that is intrinsically sparse.**

X = D*(D<0.10)** # note: element wise mult**

**# convert D into a sparse matrix (type coo_matrix) **

**# note: we can initialize any type of sparse matrix. **

**# There is no particular motivation behind using **

**# coo_matrix for this example.**

S = coo_matrix(X)

labeler = KMeans(k=3)

**# convert coo to csr format **

**# note: Kmeans currently only works with CSR type sparse matrix**

labeler.fit(S.tocsr())

**# print cluster assignments for each row**

for (row, label) in enumerate(labeler.labels_):

print "row %d has label %d"%(row, label)

-----------------------------------------------------------------------------------------------

One of the issues with Example-1 is that we are constructing a sparse matrix from a full matrix. It will often be the case that we will not be able to fit a full (although intrinsically sparse) matrix in memory. For example, if the matrix X was a 100000x1000000000 full matrix, there could be some issues. One solution to this is to somehow extract out the non-zero entries of X and to use a smarter constructor for the sparse matrix.

**Sparse Matrix Construction**

In Example-2, we will assume that we have X's data stored on some file on disk. In particular, we will assume that X is stored in a csv file and that we are able to extract out the non-zero data efficiently.

**Example-2**

-------------------------------------------------

**import csv**

from scipy.sparse import *

from sklearn.cluster import KMeans

def extract_nonzero(fname):

"""

extracts nonzero entries from a csv file

input: fname (str) -- path to csv file

output: generator<(int, int, float)> -- generator

producing 3-tuple containing (row-index, column-index, data)

"""

for (rindex,row) in enumerate(csv.reader(open(fname))):

for (cindex, data) in enumerate(row):

if data!="0":

yield (rindex, cindex, float(data))

def get_dimensions(fname):

"""

determines the dimension of a csv file

input: fname (str) -- path to csv file

output: (nrows, ncols) -- tuple containing row x col data

"""

rowgen = (row for row in csv.reader(open(fname)))

# compute col size

colsize = len(rowgen.next())

# compute row size

rowsize = 1 + sum(1 for row in rowgen)

return (rowsize, colsize)

**# obtain dimensions of data**

(rdim, cdim) = get_dimensions("X.csv")

# allocate a lil_matrix of size (rdim by cdim)

# note: lil_matrix is used since we be modifying

# the matrix a lot.

S = lil_matrix((rdim, cdim))

# add data to S

for (i,j,d) in extract_nonzero("X.csv"):

S[i,j] = d

# perform clustering

labeler = KMeans(k=3)

# convert lil to csr format

# note: Kmeans currently only works with CSR type sparse matrix

labeler.fit(S.tocsr())

# print cluster assignments for each row

for (row, label) in enumerate(labeler.labels_):

print "row %d has label %d"%(row, label)

--------------------------------------------------

**What to do when Sparse Matrices aren't supported:**

When sparse matrices aren't supported, one solution is to convert the matrix to a full matrix. To do this, simply invoke the todense() method.