python - Unpacking sparse matrix performance tuning

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm using the sparse_dot_topn library created by the Data Scientists at ING to search for near duplicates in a large set of company names (nearly 1.5M records). A recent update of this library now makes it possible to use multiple threads to compute the cross-product (i.e., the cosine similarity) between the two matrices. I ran a quick benchmark and the performance improvement is significant (depending on how many cores one can use on his machine/remote server):

+-----------+--------------+
| # threads | time (%M:%S) |
+-----------+--------------+
| 32        | 03:43:12     |
+-----------+--------------+
| 16        | 05:16:97     |
+-----------+--------------+
| 8         | 08:11:69     |
+-----------+--------------+
| 4         | 13:32:72     |
+-----------+--------------+
| 2         | 24:02:28     |
+-----------+--------------+
| 1         | 47:11:30     |
+-----------+--------------+
In order to easily explore the outcome, I needed to unpack the resulting sparse matrix. Fortunately, I found the following helper function written by Chris van den Berg which does exactly that (link to Chris's blog post here):
def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]
    return pd.DataFrame(
        {"left_side": left_side, "right_side": right_side, "similairity": similairity}
The above function has an optional argument to look only at the first n values but I have to run it on the full data. My issue at the moment is that this takes a very long time to complete (roughly 1 hour).
Q: I was wondering how I could increase the performance here (if possible)? Especially since I have quite a few cores which I'm not using for this job.
I'm no expert when it comes to performance tuning. One option I explored is Numba. I decorated the function with @njit(parallel=True)and used Numba’s prange instead of range to specify that the loop can be parallelized but this failed. My understanding is that Numba can't handle string values (i.e., the company names in my case).
Any help on possible approaches to increase the performance would be greatly appreciated.
                Generally when you try something and fail, we expect a full(er) error report.  Admittedly the traceback from a numba error may be long and pointless.  numba is best when working with pure numpy and python code, not when using imported packages like your sparse one, and pandas.
– hpaulj
                Oct 1, 2020 at 19:09
I assume the sparse_matrix is a correlation matrix, so sparse_matrix is symmetric.
First, create a name_vector and sparse_matrix to work with
import string
N = 10
# create an array of names
name_vector = np.array(list(string.ascii_lowercase)[:N])
# create a correlation matrix (which is obviously symmetric)
sparse_matrix = np.random.rand(N,N)
sparse_matrix = (sparse_matrix + sparse_matrix.T)/2
zeros_mask = np.where(np.random.rand(N,N)>=0.5,False,True)
sparse_matrix[zeros_mask] = 0.
As you can see, the name_vector is an array
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='<U1')
which corresponds to the names of 10 companies. sparse_matrix is symmetric by construction, and some of its entries are assigned 0 by sparse_matrix[zeros_mask] = 0..
With the two ingredients, here is my solution
top = None 
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
sparse_idx = sparserows*sparse_matrix.shape[1]+sparsecols
if top:
    nr_matches = top
else:
    nr_matches = sparsecols.size
left_side = name_vector[sparserows[:nr_matches]]
right_side = name_vector[sparsecols[:nr_matches]]
similairity = np.take(sparse_matrix,sparse_idx[:nr_matches])
pd.DataFrame({"left_side": left_side, 
              "right_side": right_side, 
              "similairity": similairity})
and the DataFrame that is returned looks like
left_side   right_side  similairity
0   a   c   0.760297
1   a   d   0.441365
2   a   g   0.669365
3   b   a   0.221993
4   b   c   0.840993
since advanced indexing is used instead of a for loop, it will be much faster.
Without some examples I can't be sure this is what you're looking for, but I think this is what you want. I'm confused about the top in your example because it just takes the first results and not the results with the largest values.
import pandas as pd
from scipy import sparse
import random
import string
arr = sparse.random(100,100,density=0.02).tocoo()
name_vec = pd.Series(''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(6)) for _ in range(100))
pd.DataFrame({"left_side": name_vec[arr.row].tolist(), 
              "right_side": name_vec[arr.col].tolist(), 
              "similairity": arr.data})
In terms of runtime you might be able to clean this up more by avoiding the series -> list -> series step.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.