numpy 'isin' 性能改进

2024-01-15

我有一个包含 383milj 行的矩阵,我需要根据值列表过滤该矩阵(index_to_remove)。该函数在 1 次迭代期间执行多次。有没有更快的替代方案:

def remove_from_result(matrix, index_to_remove, inv=True):
    return matrix[np.isin(matrix, index_to_remove, invert=inv)]

更快的实施

这是 @Matt Messersmith 使用集合作为列表理解解决方案的编译版本。它基本上是较慢的 np.isin 方法的替代品。我遇到了一些问题index_to_remove是一个标量值,并为此情况实现了一个单独的版本。

Code

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def in1d_vec_nb(matrix, index_to_remove):
  #matrix and index_to_remove have to be numpy arrays
  #if index_to_remove is a list with different dtypes this 
  #function will fail

  out=np.empty(matrix.shape[0],dtype=nb.boolean)
  index_to_remove_set=set(index_to_remove)

  for i in nb.prange(matrix.shape[0]):
    if matrix[i] in index_to_remove_set:
      out[i]=False
    else:
      out[i]=True

  return out

@nb.njit(parallel=True)
def in1d_scal_nb(matrix, index_to_remove):
  #matrix and index_to_remove have to be numpy arrays
  #if index_to_remove is a list with different dtypes this 
  #function will fail

  out=np.empty(matrix.shape[0],dtype=nb.boolean)
  for i in nb.prange(matrix.shape[0]):
    if (matrix[i] == index_to_remove):
      out[i]=False
    else:
      out[i]=True

  return out


def isin_nb(matrix_in, index_to_remove):
  #both matrix_in and index_to_remove have to be a np.ndarray
  #even if index_to_remove is actually a single number
  shape=matrix_in.shape
  if index_to_remove.shape==():
    res=in1d_scal_nb(matrix_in.reshape(-1),index_to_remove.take(0))
  else:
    res=in1d_vec_nb(matrix_in.reshape(-1),index_to_remove)

  return res.reshape(shape)

Example

data = np.array([[80,1,12],[160,2,12],[240,3,12],[80,4,11]])
test_elts= np.array((80))

data[isin_nb(data[:,0],test_elts),:]

Tmings

test_elts = np.arange(12345)
data=np.arange(1000*1000)

#The first call has compilation overhead of about 300ms
#which is not included in the timings
#remove_from_result:     52ms
#isin_nb:                1.59ms
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

numpy 'isin' 性能改进 的相关文章

随机推荐