【数据清洗】图像数据清洗之---去除相似度高的图像

2023-05-16

目的：人工做数据清洗较为麻烦，而且费事费力没成绩，还拉拽整个项目的后腿。所以这里根据调研情况，分析尝试一下。

1.调研分析

(1) 百度EasyData

参考：百度大脑自己的csdn说明

有几点：

1）过滤无对应目标的图，如果有开源通用模型，人脸或者行人检测等，是可以解决的。如果是无，那前期也是需要自己训练个检测模型，然后做半自动的标注与筛选的。然后根据新加入的数据，继续迭代地优化模型。

2）去相似图片。这里的相似是特征相似度。如果有对应的分类提取特征的网络结构的话（有开源用开源，无开源就迭代训练），那么可以利用图像检索的方式获得图像特征向量，然后计算特征向量的cosine距离或者欧式距离等。设置较大的阈值，将较为冗余的图像筛选去掉。

3）去模糊图片。即去除图像质量不高的图片，猜测这个可以找网上的对于图像清晰度打分的这种模型或者是图像质量评估IQA的模型，进行筛选。初步判断不太需要自己去训练模型，但是呢，对于最后保留什么样质量的图像，这个是我们要自己写一些逻辑的。不过难度也不大。

(2) 华为ModelArts

参考：华为数据清洗服务

可以根据提供正样本与负样本，将数据筛选好。看了一下，里面也是有特征提取，相似度计算的。提到的主要有数据清洗聚类、异常检测、相似度计算、特征提取器等。教程写得很详细。

(3) 专利

还有一篇专利，关于根据图像特征相似度，做数据清洗的。https://patentimages.storage.googleapis.com/a0/89/06/563a1061fac7fe/CN107480203A.pdf

2.实践

import torch
import torch.nn.functional as F
import os
import shutil
import cv2

def compute_consine_similar(features, others):
    features = torch.from_numpy(features)
    others = torch.from_numpy(others)
    features = F.normalize(features, p=2, dim=1)
    others = F.normalize(others, p=2, dim=1)
    similar_m = torch.mm(features, others.t())
    return similar_m.cpu().numpy()


if __name__ == '__main__':
    test_path = "/train_prepare/"
    save_path = "/crops_remove/train_prepare/"
    if not os.path.exists(save_path):
        os.makedirs(save_path)
    feature_pre, feature_now = None, None
    img_name_pre, img_name_now = None, None
    img_pre, img_now = None, None
    
    net = Net()
    state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)[
            'net_dict']
    net.load_state_dict(state_dict)

    for folder in os.listdir(test_path):
        for img_name in sorted(os.listdir(test_path + folder)):
            img_path = test_path + folder + "/" + img_name
            print("img_path : {}".format(img_path))
            img = cv2.imread(img_path)
            img = cv2.resize(img, (256, 128))
            
            feature = net.eval(img)
            feature_pre, feature_now = feature_now, feature
            img_name_pre, img_name_now = img_name_now, img_name
            img_pre, img_now = img_now, img

            if feature_now is not None and feature_pre is not None:
                print(feature_now.shape)
                similar_m = compute_consine_similar(feature_now, feature_pre)
                print(img_name_now + " and " + img_name_pre + " have {} similar".format(similar_m[0][0]))

                imgs = np.hstack([img_now, img_pre])
                # 展示多个
                cv2.putText(imgs, str(similar_m[0][0]), (20,20), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0), 4)
                cv2.imshow("mutil_pic", imgs)
                # 等待关闭
                if similar_m[0][0] <= 0.999:
                    cv2.waitKey(1000)
                else:
                    if not os.path.exists(save_path+folder):
                        os.makedirs(save_path+folder)
                    remove_path = save_path + folder + "/" + img_name
                    print("We will move {} to {} !".format(img_name, save_path))
                    shutil.move(img_path, remove_path)
                    cv2.waitKey(100)

大致按照上面的流程来的。

1）完成根据数据特征相似度筛选功能；

2）可视化前后帧的一个opencv窗口；（参考：https://zoutao.blog.csdn.net/article/details/87009082）

3）将筛选出来的数据保存起来。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)