Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I'm using the
sparse_dot_topn
library created by the Data Scientists at ING to search for near duplicates in a large set of company names (nearly 1.5M records). A recent update of this library now makes it possible to use multiple threads to compute the cross-product (i.e., the cosine similarity) between the two matrices. I ran a quick benchmark and the performance improvement is significant (depending on how many cores one can use on his machine/remote server):
+-----------+--------------+
| # threads | time (%M:%S) |
+-----------+--------------+
| 32 | 03:43:12 |
+-----------+--------------+
| 16 | 05:16:97 |
+-----------+--------------+
| 8 | 08:11:69 |
+-----------+--------------+
| 4 | 13:32:72 |
+-----------+--------------+
| 2 | 24:02:28 |
+-----------+--------------+
| 1 | 47:11:30 |
+-----------+--------------+
In order to easily explore the outcome, I needed to unpack the resulting sparse matrix. Fortunately, I found the following helper function written by Chris van den Berg which does exactly that (link to Chris's blog post here):
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame(
{"left_side": left_side, "right_side": right_side, "similairity": similairity}
The above function has an optional argument to look only at the first n values but I have to run it on the full data. My issue at the moment is that this takes a very long time to complete (roughly 1 hour).
Q: I was wondering how I could increase the performance here (if possible)? Especially since I have quite a few cores which I'm not using for this job.
I'm no expert when it comes to performance tuning. One option I explored is Numba. I decorated the function with @njit(parallel=True)
and used Numba’s prange
instead of range
to specify that the loop can be parallelized but this failed. My understanding is that Numba can't handle string values (i.e., the company names in my case).
Any help on possible approaches to increase the performance would be greatly appreciated.
–
I assume the sparse_matrix
is a correlation matrix, so sparse_matrix
is symmetric.
First, create a name_vector
and sparse_matrix
to work with
import string
N = 10
# create an array of names
name_vector = np.array(list(string.ascii_lowercase)[:N])
# create a correlation matrix (which is obviously symmetric)
sparse_matrix = np.random.rand(N,N)
sparse_matrix = (sparse_matrix + sparse_matrix.T)/2
zeros_mask = np.where(np.random.rand(N,N)>=0.5,False,True)
sparse_matrix[zeros_mask] = 0.
As you can see, the name_vector
is an array
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='<U1')
which corresponds to the names of 10 companies. sparse_matrix
is symmetric by construction, and some of its entries are assigned 0 by sparse_matrix[zeros_mask] = 0.
.
With the two ingredients, here is my solution
top = None
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
sparse_idx = sparserows*sparse_matrix.shape[1]+sparsecols
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = name_vector[sparserows[:nr_matches]]
right_side = name_vector[sparsecols[:nr_matches]]
similairity = np.take(sparse_matrix,sparse_idx[:nr_matches])
pd.DataFrame({"left_side": left_side,
"right_side": right_side,
"similairity": similairity})
and the DataFrame
that is returned looks like
left_side right_side similairity
0 a c 0.760297
1 a d 0.441365
2 a g 0.669365
3 b a 0.221993
4 b c 0.840993
since advanced indexing is used instead of a for loop, it will be much faster.
Without some examples I can't be sure this is what you're looking for, but I think this is what you want. I'm confused about the top
in your example because it just takes the first results and not the results with the largest values.
import pandas as pd
from scipy import sparse
import random
import string
arr = sparse.random(100,100,density=0.02).tocoo()
name_vec = pd.Series(''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(6)) for _ in range(100))
pd.DataFrame({"left_side": name_vec[arr.row].tolist(),
"right_side": name_vec[arr.col].tolist(),
"similairity": arr.data})
In terms of runtime you might be able to clean this up more by avoiding the series -> list -> series step.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.