添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I want to calculate the cosine similarity between two lists , let's say for example list 1 which is dataSetI and list 2 which is dataSetII .

Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15] . The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
def cosine_similarity(list1, list2):
  # How to?
print(cosine_similarity(dataSetI, dataSetII))
                FYI this solution is significantly faster on my system than using scipy.spatial.distance.cosine.
– Ozzah
                Apr 17, 2019 at 23:39
                On my system, SciPy is roughly the same speed as this, for tiny/dummy use-cases. SciPy is optimised for efficiently comparing large lists of vectors. For large lists of vectors, SciPy is orders of magnitude faster
– jameslol
                Nov 18, 2022 at 4:18

You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.

Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.

from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

You can use cosine_similarity function form sklearn.metrics.pairwise docs

In [23]: from sklearn.metrics.pairwise import cosine_similarity
In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])

I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:

import math
def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)
v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))
Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712

That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.

ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.

CPYthon 2.7.3 on 3.0GHz Core 2 Duo:

>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264

So, the unpythonic way is about 3.6 times faster in this case.

without using numpy.dot() you have to create your own dot function using list comprehension:

def dot(A,B): 
    return (sum(a*b for a,b in zip(A,B)))

and then its just a simple matter of applying the cosine similarity formula:

def cosine_similarity(a,b):
    return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )

I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:

def dot_product2(v1, v2):
    return sum(map(operator.mul, v1, v2))
def vector_cos5(v1, v2):
    prod = dot_product2(v1, v2)
    len1 = math.sqrt(dot_product2(v1, v1))
    len2 = math.sqrt(dot_product2(v2, v2))
    return prod / (len1 * len2)

The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.

Please advise on the tool you are using to get visualized representation of time consumed per line of code. – YoungSheldon Mar 17 at 7:26 def calculate_cosine_distance(a, b): cosine_distance = float(spatial.distance.cosine(a, b)) return cosine_distance def calculate_cosine_similarity(a, b): cosine_similarity = 1 - calculate_cosine_distance(a, b) return cosine_similarity def calculate_angular_distance(a, b): cosine_similarity = calculate_cosine_similarity(a, b) angular_distance = math.acos(cosine_similarity) / math.pi return angular_distance def calculate_angular_similarity(a, b): angular_similarity = 1 - calculate_angular_distance(a, b) return angular_similarity

Similarity Search:

If you want to find closest cosine similarity in array of embeddings, you can use Tensorflow, like the following code.

In my testing, closeset value to an embedding with the shape of 1x512 found in 1M embeddings (1'000'000 x 512) in less than a second (using GPU).

import time
import numpy as np  # np.__version__ == '1.23.5'
import tensorflow as tf  # tf.__version__ == '2.11.0'
EMBEDDINGS_LENGTH = 512
NUMBER_OF_EMBEDDINGS = 1000 * 1000
def calculate_cosine_similarities(x, embeddings):
    cosine_similarities = -1 * tf.keras.losses.cosine_similarity(x, embeddings)
    return cosine_similarities.numpy()
def find_closest_embeddings(x, embeddings, top_k=1):
    cosine_similarities = calculate_cosine_similarities(x, embeddings)
    values, indices = tf.math.top_k(cosine_similarities, k=top_k)
    return values.numpy(), indices.numpy()
def main():
    # x shape: (512)
    # Embeddings shape: (1000000, 512)
    x = np.random.rand(EMBEDDINGS_LENGTH).astype(np.float32)
    embeddings = np.random.rand(NUMBER_OF_EMBEDDINGS, EMBEDDINGS_LENGTH).astype(np.float32)
    print('Embeddings shape: ', embeddings.shape)
    n = 100
    sum_duration = 0
    for i in range(n):
        start = time.time()
        best_values, best_indices = find_closest_embeddings(x, embeddings, top_k=1)
        end = time.time()
        duration = end - start
        sum_duration += duration
        print('Duration (seconds): {}, Best value: {}, Best index: {}'.format(duration, best_values[0], best_indices[0]))
    # Average duration (seconds): 1.707 for Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
    # Average duration (seconds): 0.961 for NVIDIA 1080 ti
    print('Average duration (seconds): ', sum_duration / n)
if __name__ == '__main__':
    main()

For more advanced similarity search, you can use Milvus, Weaviate or Faiss.

  • https://en.wikipedia.org/wiki/Cosine_similarity
  • https://gist.github.com/amir-saniyan/e102de09b01c4ed1632e3d1a1a1cbf64
  • def cosine_measure(v1, v2): prod = dot_product(v1, v2) len1 = math.sqrt(dot_product(v1, v1)) len2 = math.sqrt(dot_product(v2, v2)) return prod / (len1 * len2)

    You can round it after computing:

    cosine = format(round(cosine_measure(v1, v2), 3))
    

    If you want it really short, you can use this one-liner:

    from math import sqrt
    from itertools import izip
    def cosine_measure(v1, v2):
        return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))
    

    You can use this simple function to calculate the cosine similarity:

    def cosine_similarity(a, b):
      return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))
      vec2 = text2
      intersection = set(vec1.keys()) & set(vec2.keys())
      numerator = sum([vec1[x] * vec2[x] for x in intersection])
      sum1 = sum([vec1[x]**2 for x in vec1.keys()])
      sum2 = sum([vec2[x]**2 for x in vec2.keys()])
      denominator = math.sqrt(sum1) * math.sqrt(sum2)
      if not denominator:
         return 0.0
      else:
         return round(float(numerator) / denominator, 3)
    dataSet1 = [3, 45, 7, 2]
    dataSet2 = [2, 54, 13, 15]
    get_cosine(dataSet1, dataSet2)
                    This is a text implementation of cosine. It will give the wrong output for numerical input.
    – alvas
                    Jan 12, 2016 at 10:17
                    Can you explain why you used set in the line "intersection = set(vec1.keys()) & set(vec2.keys())".
    – Ghos3t
                    Apr 12, 2019 at 0:17
    

    Using numpy compare one list of numbers to multiple lists(matrix):

    def cosine_similarity(vector,matrix):
       return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]
    

    If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.

    Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:

    import torch
    import torch.nn as nn
    cos = nn.CosineSimilarity()
    cos(torch.tensor([v1]), torch.tensor([v2])).item()
    

    Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:

    cos(torch.tensor(w1), torch.tensor(w2)).tolist()
                    I suggest using the functional implementation of the cosine similarity directly (torch.nn.functional.cosine_similarity), instead of instantiating the module implementation and applying the instance of your tensor.
    – eavsteen
                    Mar 4, 2021 at 21:23
    

    Another version, if you have a scenario where you have list of vectors and a query vector and you want to compute the cosine similarity of query vector with all the vectors in the list, you can do it in one go in the below fashion:

    >>> import numpy as np
    >>> A      # list of vectors, shape -> m x n
    array([[ 3, 45,  7,  2],
           [ 1, 23,  3,  4]])
    >>> B      # query vector, shape -> 1 x n
    array([ 2, 54, 13, 15])
    >>> similarity_scores = A.dot(B)/ (np.linalg.norm(A, axis=1) * np.linalg.norm(B))
    >>> similarity_scores
    array([0.97228425, 0.99026919])
    dataSetI = [3, 45, 7, 2]
    dataSetII = [2, 54, 13, 15]
    print(1 - spatial.distance.cosine(dataSetI, dataSetII))
    

    Note that spatial.distance.cosine() gives you a dissimilarity (distance) value, and thus to get the similarity, you need to subtract that value from 1.

    Another way to get to the solution is to write the function yourself that even contemplates the possibility of lists with different lengths:

    def cosineSimilarity(v1, v2):
      scalarProduct = moduloV1 = moduloV2 = 0
      if len(v1) > len(v2):
        v2.extend(0 for _ in range(len(v1) - len(v2)))
      else:
        v2.extend(0 for _ in range(len(v2) - len(v1)))
      for i in range(len(v1)):
        scalarProduct += v1[i] * v2[i]
        moduloV1 += v1[i] * v1[i]
        moduloV2 += v2[i] * v2[i]
      return round(scalarProduct/(math.sqrt(moduloV1) * math.sqrt(moduloV2)), 3)
    dataSetI = [3, 45, 7, 2]
    dataSetII = [2, 54, 13, 15]
    print(cosineSimilarity(dataSetI, dataSetII))
    

    We can easily calculate cosine similarity with simple mathematics equations. Cosine_similarity = 1- (dotproduct of vectors/(product of norm of the vectors)). We can define two functions each for calculations of dot product and norm.

    def dprod(a,b):
        sum=0
        for i in range(len(a)):
            sum+=a[i]*b[i]
        return sum
    def norm(a):
        norm=0
        for i in range(len(a)):
        norm+=a[i]**2
        return norm**0.5
        cosine_a_b = 1-(dprod(a,b)/(norm(a)*norm(b)))
    

    Here is an implementation that would work for matrices as well. Its behaviour is exactly like sklearn cosine similarity:

    def cosine_similarity(a, b):    
        return np.divide(
            np.dot(a, b.T),
            np.linalg.norm(
                axis=1,
                keepdims=True
            @ # matrix multiplication
            np.linalg.norm(
                axis=1,
                keepdims=True
    

    The @ symbol stands for matrix multiplication. See What does the "at" (@) symbol do in Python?

    All the answers are great for situations where you cannot use NumPy. If you can, here is another approach:

    def cosine(x, y):
        dot_products = np.dot(x, y.T)
        norm_products = np.linalg.norm(x) * np.linalg.norm(y)
        return dot_products / (norm_products + EPSILON)
    

    Also bear in mind about EPSILON = 1e-07 to secure the division.

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.

    How can I find cosine similarity between input array and pandas dataframe and return the row in dataframe which is most similar? See more linked questions