我用 SOM 做了一些实验。首先,我在 Python 中使用 MiniSOM,但没有留下深刻的印象,于是改用 R 中的 kohonen 包,它比以前提供了更多功能。基本上,我将 SOM 应用到三个用例:(1) 使用生成的数据进行二维聚类,(2) 使用更多维数据进行聚类:内置葡萄酒数据集,以及 (3) 异常值检测。我解决了所有三个用例,但我想提出一个与我应用的异常值检测有关的问题。为此,我使用了向量索姆$距离,其中包含输入数据集每行的距离。具有出色距离的值可能是异常值。但是,我不知道这个距离是如何计算的。包描述(https://cran.r-project.org/web/packages/kohonen/kohonen.pdf https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)该指标的状态:“到最近单位的距离”。

  1. 你能告诉我这个距离是如何计算的吗?
  2. 您能评论一下我使用的异常值检测吗?你会怎么做呢? (在生成的数据集中,它确实找到了异常值。在 真实的葡萄酒数据集中,177个葡萄酒品种中,有四个相对优秀的数值。看 下面的图表。我突然想到使用条形图来描述这一点,我真的很喜欢。)


  • Generated data, 100 point in 2D in 5 distinct clusters and 2 outliers (Category 6 shows the outliers): enter image description here

  • Distances shown for all the 102 data points, the last two ones are the outliers which were correctly identified. I repeated the test with 500, and 1000 data points and added solely 2 outliers. The outliers were also found in those cases. enter image description here

  • Distances for the real wine data set with potential outliers: enter image description here

潜在异常值的行 ID:

# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]

it produces the following output:
    index    value
59     59 12.22916
110   110 13.41211
121   121 15.86576
158   158 11.50079



        scaled_wines <- scale(wines)

        # creating and training SOM
        som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))

        #looking for outliers, dist = distance to the closest unit

        len <- length(som.wines$distances)
        index_in_vector <- c(1:len)
        df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
        colnames(df_wine) <-c("index", "value")

        po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity") 
        po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")

        # print the row id of the outliers
        # the threshold 10 can be taken from the bar chart,
        # below which the vast majority of the values fall
        df_wine[df_wine$value > 10, ]


关于评论中的讨论,我还发布了所需的代码片段。据我记得,负责聚类的代码行是根据我在 Kohonen 包的描述中找到的示例构建的(https://cran.r-project.org/web/packages/kohonen/kohonen.pdf https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)。不过,我不太确定,那是一年多前的事了。该代码按原样提供,没有任何保证:-)。请记住,特定的聚类方法可能会在不同的数据上以不同的精度执行。我还建议将其与葡萄酒数据集上的 t-SNE 进行比较(data(wines)在 R 中可用)。此外,实施热图来演示如何定位有关各个变量的数据。 (在上面有 2 个变量的示例中,这并不重要,但对于葡萄酒数据集来说会很好)。

具有五个聚类和 2 个离群值的数据生成和绘图



            generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
              num_of_points_per_cluster <- num_of_points/num_of_clusters
              cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s \n", num_of_points_per_cluster, num_of_clusters))
              standard_dev_y <- 6000
              standard_dev_x <- 2
              # for reproducibility setting the random generator
              for (i in 1:num_of_clusters){
                centroid_y <- runif(1, min=10000, max=200000)
                centroid_x <- runif(1, min=20, max=70)
                cat(sprintf("centroid_x = %s \n, centroid_y = %s", centroid_x, centroid_y ))
                vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
                vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
                cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
                cluster <- cbind(cluster, i)
                arr <- rbind(arr, cluster)
                #adding two outliers
                arr <- rbind(arr, c(10000, 30, 6))
                arr <- rbind(arr, c(150000, 70, 6))
              colnames(arr) <-c("y", "x", "Cluster")
              # WA to remove the first NA row
              arr <- na.omit(arr)

            scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
              df <- data.frame(data_in)
              colnames(df) <-c("y", "x", "Cluster")

              pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx]))) 
              pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))

            # generating data
            data <- generate_data(100, 5, TRUE)
            scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers \n 102 Points")


我使用了 Kohonen Map (SOM) 的层次聚类方法。

            normalising_data <- function(data){
              # normalizing data points not the cluster identifiers
              mtrx <- data.matrix(data)
              umtrx <- scale(mtrx[,1:2])
              umtrx <- cbind(umtrx, factor(mtrx[,3]))
              colnames(umtrx) <-c("y", "x", "Cluster")

            train_som <- function(umtrx){
              # unsupervised learning
              g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
              #map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
              map<-som(umtrx[, 1:2], grid=g)

            plot_som_data <- function(map){
              # to plot some charactristics of the SOM map
              plot(map, type='changes')
              plot(map, type='codes', main="Mapping Data")
              plot(map, type='count')
              plot(map, type='mapping') # how many data points are held by each neuron
              plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
              #to switch the plot config to the normal

            plot_disstances_to_the_closest_point <- function(map){
              # to see which neuron is assigned to which value 
              #looking for outliers, dist = distance to the closest unit
              len <- length(map$distances)
              index_in_vector <- c(1:len)
              df<-data.frame(cbind(index_in_vector, map$distances))
              colnames(df) <-c("index", "value")
              po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity") 
              po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")

            # unsupervised learning

            umtrx <- normalising_data(data)


            # creating the dendogram and then the clusters for the neurons
            dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')

            clusters <- cutree(dendogram, 7)

            #visualising the clusters on the map
            par(mfrow = c(1,1))
            plot(map, type='dist.neighbours', main="Mapping Data")
            add.cluster.boundaries(map, clusters)


您还可以为选定的变量创建漂亮的热图,但我还没有实现它们以使用 2 个变量进行聚类,这实际上没有意义。如果您为葡萄酒数据集实现它,请将代码和图表添加到本文中。

            #see the predicted clusters with the data set
            # 1. add the vector of the neuron ids to the data
            mapped_neurons <- map$unit.classif
            umtrx <- cbind(umtrx, mapped_neurons)

            # 2. taking the predicted clusters and adding them the the original matrix
            # very good description of the apply functions:
            # https://www.guru99.com/r-apply-sapply-tapply.html
            get_cluster_for_the_row <- function(x, cltrs){

            predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)

            mtrx <- cbind(mtrx, predicted_clusters)
            scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers \n 100 points")


  1. 虽然我不太确定,但我经常发现驻留在特定维度空间中的两个物体的距离测量主要使用欧几里德距离。例如,位置为A(x=3,y=4)和B(x=6,y=8)的二维空间中的两个点A和B相距5个距离单位。它是执行平方根((3-6)^2 + (4-8)^2)计算的结果。这也适用于维度较大的数据,通过添加特定维度中两点值之差的两个尾随幂。如果 A(x=3, y=4, z=5) 和 B(x=6, y=8, z=7) 则距离为平方根((3-6)^2 + (4-8)^ 2 + (5-7)^2),依此类推。在kohonen中,我认为模型完成训练阶段后,算法会计算每个数据到所有节点的距离,然后将其分配给最近的节点(与其距离最小的节点)。最终,模型返回的变量“距离”内的值是每个数据到其最近节点的距离。脚本中需要注意的一件事是,算法不会直接测量与数据所具有的原始属性值的距离,因为它们在将数据输入模型之前已经进行了缩放。距离测量应用于数据的缩放版本。缩放是消除一个变量对另一个变量的主导地位的标准程序。
  2. 我相信你的方法是可以接受的,因为“距离”变量内的值是每个数据到最近节点的距离。因此,如果一个数据与其最近的节点之间的距离值很高,那么这也意味着:该数据到其他节点的距离显然要高得多。

