【数据挖掘】(一)用jupyter编程

2023-11-04

为熟悉jupyter,找了一本书练习。
参考资料:《Python数据挖掘入门与实践》
数据集:

https://github.com/packtpublishing/learning-data-mining-with-python

第一行代码

import numpy as np
dataset_filename = "affinity_dataset.txt"
X =np.loadtxt(dataset_filename)
print(X[:5])

[[0. 0. 1. 1. 1.]
[1. 1. 0. 1. 0.]
[1. 0. 1. 1. 0.]
[0. 0. 1. 1. 1.]
[0. 1. 0. 0. 1.]]

在这里插入图片描述

num_apple_purchases=0
for sample in X:
  if sample[3]==1:
     num_apple_purchases +=1
print("{0}prople bought Apples".format(num_apple_purchases))

在这里插入图片描述

分类

鸢尾花数据集

from sklearn.datasets import load_iris
dataset =load_iris()
X=dataset.data
y=dataset.target
print(dataset.DESCR)

在这里插入图片描述
把连续值转变为类别型,这个过程叫做离散化。
最简单的离散化方法,莫过于确定一个阈值,将低于该阈值的特征值置为0,高于阈值的置为1.
我们把某项特征的阈值设定为该特征所有特征值的均值。
每个特征的均值计算方法如下:

attribute_means =X.mean(axis=0)

我们得到了一个长度为4的数组,这是特征的数量。
数组的第一项是第一个特征的均值,以此类推。
接下来,用该方法将数据集打散,把连续的特征值转换为类别型。

X_d=np.array(X >= attribute_means,dtype='int')

在这里插入图片描述
后面的训练和测试,都将用到新得到的X_d数据集(打散后的数组X),而不使用原来的数据集(X)

attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')
import sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set the random state to the same number to get the same results as in the book
random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))

There are (112,) training samples
There are (38,) testing samples

在这里插入图片描述

from collections import defaultdict
from operator import itemgetter


def train(X, y_true, feature):
    """Computes the predictors and error for a given feature using the OneR algorithm
    
    Parameters
    ----------
    X: array [n_samples, n_features]
        The two dimensional array that holds the dataset. Each row is a sample, each column
        is a feature.
    
    y_true: array [n_samples,]
        The one dimensional array that holds the class values. Corresponds to X, such that
        y_true[i] is the class value for sample X[i].
    
    feature: int
        An integer corresponding to the index of the variable we wish to test.
        0 <= variable < n_features
        
    Returns
    -------
    predictors: dictionary of tuples: (value, prediction)
        For each item in the array, if the variable has a given value, make the given prediction.
    
    error: float
        The ratio of training data that this rule incorrectly predicts.
    """
    # Check that variable is a valid number
    n_samples, n_features = X.shape
    assert 0 <= feature < n_features
    # Get all of the unique values that this variable has
    values = set(X[:,feature])
    # Stores the predictors array that is returned
    predictors = dict()
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    # Compute the total error of using this feature to classify on
    total_error = sum(errors)
    return predictors, total_error

# Compute what our predictors say each sample is based on its value
#y_predicted = np.array([predictors[sample[feature]] for sample in X])
    

def train_feature_value(X, y_true, feature, value):
    # Create a simple dictionary to count how frequency they give certain predictions
    class_counts = defaultdict(int)
    # Iterate through each sample and count the frequency of each class/value pair
    for sample, y in zip(X, y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    # Now get the best one by sorting (highest first) and choosing the first item
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    # The error is the number of samples that do not classify as the most frequent class
    # *and* have the feature value.
    n_samples = X.shape[1]
    error = sum([class_count for class_value, class_count in class_counts.items()
                 if class_value != most_frequent_class])
    return most_frequent_class, error
# Compute all of the predictors
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}
# Now choose the best and save that as "model"
# Sort by error
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]
print("The best model is based on variable {0} and has error {1:.2f}".format(best_variable, best_error))

# Choose the bset model
model = {'variable': best_variable,
         'predictor': all_predictors[best_variable][0]}
print(model)

The best model is based on variable 2 and has error 37.00
{‘variable’: 2, ‘predictor’: {0: 0, 1: 2}}
在这里插入图片描述

def predict(X_test, model):
    variable = model['variable']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
    return y_predicted

我们经常需要一次对多条数据进行预测,因此用代码实现这个函数,通过遍历数据集中的每条数据来完成预测。

y_predicted = predict(X_test, model)
print(y_predicted)

[0 0 0 2 2 2 0 2 0 2 2 0 2 2 0 2 0 2 2 2 0 0 0 2 0 2 0 2 2 0 0 0 2 0 2 0 2
2]

比较预测结果和实际类别,就能得到正确率是多少。

# Compute the accuracy by taking the mean of the amounts that y_predicted is equal to y_test
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))

The test accuracy is 65.8%

在这里插入图片描述

from sklearn.metrics import classification_report
print(classification_report(y_test, y_predicted))

在这里插入图片描述

近邻

主目录位置

在这里插入图片描述

数据集:

http://archive.ics.uci.edu/ml/datasets/Ionosphere
在这里插入图片描述

在这里插入图片描述

# Change this to the location of your dataset
data_folder = os.path.join(home_folder, "Data", "Ionosphere")
data_filename = os.path.join(data_folder, "ionosphere.data")
print(data_filename)

C:\Users\83854\Data\Ionosphere\ionosphere.data

import csv
import numpy as np

# Size taken from the dataset and is known
X = np.zeros((351, 34), dtype='float')
y = np.zeros((351,), dtype='bool')

with open(data_filename, 'r') as input_file:
    reader = csv.reader(input_file)
    for i, row in enumerate(reader):
        # Get the data, converting each item to a float
        data = [float(datum) for datum in row[:-1]]
        # Set the appropriate row in our dataset
        X[i] = data
        # 1 if the class is 'g', 0 otherwise
        y[i] = row[-1] == 'g'
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)
print("There are {} samples in the training dataset".format(X_train.shape[0]))
print("There are {} samples in the testing dataset".format(X_test.shape[0]))
print("Each sample has {} features".format(X_train.shape[1]))

There are 263 samples in the training dataset
There are 88 samples in the testing dataset
Each sample has 34 features

from sklearn.neighbors import KNeighborsClassifier

estimator = KNeighborsClassifier()
estimator.fit(X_train, y_train)

在这里插入图片描述

y_predicted = estimator.predict(X_test)
accuracy = np.mean(y_test == y_predicted) * 100
print("The accuracy is {0:.1f}%".format(accuracy))

The accuracy is 86.4%

from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator, X, y, scoring='accuracy')
average_accuracy = np.mean(scores) * 100
print("The average accuracy is {0:.1f}%".format(average_accuracy))

The average accuracy is 82.6%

在这里插入图片描述

avg_scores = []
all_scores = []
parameter_values = list(range(1, 21))  # Including 20
for n_neighbors in parameter_values:
    estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
    scores = cross_val_score(estimator, X, y, scoring='accuracy')
    avg_scores.append(np.mean(scores))
    all_scores.append(scores)
from matplotlib import pyplot as plt
plt.figure(figsize=(32,20))
plt.plot(parameter_values, avg_scores, '-o', linewidth=5, markersize=24)
#plt.axis([0, max(parameter_values), 0, 1.0])

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

for parameter, scores in zip(parameter_values, all_scores):
    n_scores = len(scores)
    plt.plot([parameter] * n_scores, scores, '-o')

在这里插入图片描述

plt.plot(parameter_values, all_scores, 'bx')

[<matplotlib.lines.Line2D at 0x17f0c1866b0>,
<matplotlib.lines.Line2D at 0x17f0c186770>,
<matplotlib.lines.Line2D at 0x17f0c186890>,
<matplotlib.lines.Line2D at 0x17f0c185600>,
<matplotlib.lines.Line2D at 0x17f0c1869b0>]

在这里插入图片描述

电影推荐

数据集:

https://grouplens.org/datasets/movielens/

import os
import pandas as pd
#data_folder =os.path.join(os.path.expanduser("~"),"shujvji","ml-100k")
#ratings_filename=os.path.join(data_folder,"u.data")
ratings_filename = r"D:\coder\randomnumbers\shujvji\ml-100k\u.data"

该数据集非常规整,但有几点与pandas.read_csv方法的默认设置有出入,所以要调整参数设置。
第一个问题是数据集每行的几个数据之间用制表符而不是逗号分隔。
其次,没有表头,这表示数据集的第一行就是数据部分,我们需要手动为各列添加名称。

all_ratings=pd.read_csv(ratings_filename,delimiter="\t",header=None,names=["UserID","MovieID","Rating","Datetime"])

运行下面代码,看一下前五条记录:

all_ratings[:5]

UserID MovieID Rating Datetime
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

在这里插入图片描述

完整代码

下面是完整代码:

import os
data_folder = os.path.join(os.path.expanduser("~"), "Data", "ml-100k")
ratings_filename = os.path.join(data_folder, "u.data")
import pandas as pd
all_ratings = pd.read_csv(ratings_filename, delimiter="\t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"])
all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'],unit='s')
all_ratings[:5]
UserID MovieID Rating Datetime
0 196 242 3 1997-12-04 15:55:49
1 186 302 3 1998-04-04 19:22:22
2 22 377 1 1997-11-07 07:18:36
3 244 51 2 1997-11-27 05:02:03
4 166 346 1 1998-02-02 05:33:16
all_ratings[all_ratings["UserID"] == 675].sort_values("MovieID")  
UserID MovieID Rating Datetime
81098 675 86 4 1998-03-10 00:26:14
90696 675 223 1 1998-03-10 00:35:51
92650 675 235 1 1998-03-10 00:35:51
95459 675 242 4 1998-03-10 00:08:42
82845 675 244 3 1998-03-10 00:29:35
53293 675 258 3 1998-03-10 00:11:19
97286 675 269 5 1998-03-10 00:08:07
93720 675 272 3 1998-03-10 00:07:11
73389 675 286 4 1998-03-10 00:07:11
77524 675 303 5 1998-03-10 00:08:42
47367 675 305 4 1998-03-10 00:09:08
44300 675 306 5 1998-03-10 00:08:07
53730 675 311 3 1998-03-10 00:10:47
54284 675 312 2 1998-03-10 00:10:24
63291 675 318 5 1998-03-10 00:21:13
87082 675 321 2 1998-03-10 00:11:48
56108 675 344 4 1998-03-10 00:12:34
53046 675 347 4 1998-03-10 00:07:11
94617 675 427 5 1998-03-10 00:28:11
69915 675 463 5 1998-03-10 00:16:43
46744 675 509 5 1998-03-10 00:24:25
46598 675 531 5 1998-03-10 00:18:28
52962 675 650 5 1998-03-10 00:32:51
94029 675 750 4 1998-03-10 00:08:07
53223 675 874 4 1998-03-10 00:11:19
62277 675 891 2 1998-03-10 00:12:59
77274 675 896 5 1998-03-10 00:09:35
66194 675 900 4 1998-03-10 00:10:24
54994 675 937 1 1998-03-10 00:35:51
61742 675 1007 4 1998-03-10 00:25:22
49225 675 1101 4 1998-03-10 00:33:49
50692 675 1255 1 1998-03-10 00:35:51
74202 675 1628 5 1998-03-10 00:30:37
47866 675 1653 5 1998-03-10 00:31:53
all_ratings["Favorable"] = all_ratings["Rating"] > 3
all_ratings[10:15]
UserID MovieID Rating Datetime Favorable
10 62 257 2 1997-11-12 22:07:14 False
11 286 1014 5 1997-11-17 15:38:45 True
12 200 222 5 1997-10-05 09:05:40 True
13 210 40 3 1998-03-27 21:59:54 False
14 224 29 3 1998-02-21 23:40:57 False
all_ratings[all_ratings["UserID"] == 1][:5]
UserID MovieID Rating Datetime Favorable
202 1 61 4 1997-11-03 07:33:40 True
305 1 189 3 1998-03-01 06:15:28 False
333 1 33 4 1997-11-03 07:38:19 True
334 1 160 4 1997-09-24 03:42:27 True
478 1 20 4 1998-02-14 04:51:23 True
ratings = all_ratings[all_ratings['UserID'].isin(range(200))]  # & ratings["UserID"].isin(range(100))]
favorable_ratings = ratings[ratings["Favorable"]]
favorable_ratings[:5]
UserID MovieID Rating Datetime Favorable
16 122 387 5 1997-11-11 17:47:39 True
20 119 392 4 1998-01-30 16:13:34 True
21 167 486 4 1998-04-16 14:54:12 True
26 38 95 5 1998-04-13 01:14:54 True
28 63 277 4 1997-10-01 23:10:01 True
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"])
len(favorable_reviews_by_users)
199
num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum()
num_favorable_by_movie.sort_values("Favorable", ascending=False)[:5]
Favorable
MovieID
50 100
100 89
258 83
181 79
174 74
from collections import defaultdict

def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)
    for user, reviews in favorable_reviews_by_users.items():
        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
import sys
frequent_itemsets = {}  # itemsets are sorted by length
min_support = 50

# k=1 candidates are the isbns with more than min_support favourable reviews
frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"])
                                for movie_id, row in num_favorable_by_movie.iterrows()
                                if row["Favorable"] > min_support)

print("There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[1]), min_support))
sys.stdout.flush()
for k in range(2, 20):
    # Generate candidates of length k, using the frequent itemsets of length k-1
    # Only store the frequent itemsets
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],
                                                   min_support)
    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
        #print(cur_frequent_itemsets)
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets
# We aren't interested in the itemsets of length 1, so remove those
del frequent_itemsets[1]
There are 16 movies with more than 50 favorable reviews
I found 93 frequent itemsets of length 2
I found 295 frequent itemsets of length 3
I found 593 frequent itemsets of length 4
I found 785 frequent itemsets of length 5
I found 677 frequent itemsets of length 6
I found 373 frequent itemsets of length 7
I found 126 frequent itemsets of length 8
I found 24 frequent itemsets of length 9
I found 2 frequent itemsets of length 10
Did not find any frequent itemsets of length 11
print("Found a total of {0} frequent itemsets".format(sum(len(itemsets) for itemsets in frequent_itemsets.values())))
Found a total of 2968 frequent itemsets
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
    for itemset in itemset_counts.keys():
        for conclusion in itemset:
            premise = itemset - set((conclusion,))
            candidate_rules.append((premise, conclusion))
print("There are {} candidate rules".format(len(candidate_rules)))
There are 15285 candidate rules
print(candidate_rules[:5])
[(frozenset({7}), 1), (frozenset({1}), 7), (frozenset({50}), 1), (frozenset({1}), 50), (frozenset({1}), 56)]
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
rule_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
              for candidate_rule in candidate_rules}
min_confidence = 0.9
rule_confidence = {rule: confidence for rule, confidence in rule_confidence.items() if confidence > min_confidence}
print(len(rule_confidence))
5152
from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion))
    print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
    print("")
Rule #1
Rule: If a person recommends frozenset({98, 181}) they will also recommend 50
 - Confidence: 1.000

Rule #2
Rule: If a person recommends frozenset({172, 79}) they will also recommend 174
 - Confidence: 1.000

Rule #3
Rule: If a person recommends frozenset({258, 172}) they will also recommend 174
 - Confidence: 1.000

Rule #4
Rule: If a person recommends frozenset({1, 181, 7}) they will also recommend 50
 - Confidence: 1.000

Rule #5
Rule: If a person recommends frozenset({1, 172, 7}) they will also recommend 174
 - Confidence: 1.000

movie_name_filename = os.path.join(data_folder, "u.item")
movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman")
movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure",
                           "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir",
                           "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]
def get_movie_name(movie_id):
    title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
    title = title_object.values[0]
    return title
get_movie_name(4)
'Get Shorty (1995)'
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name))
    print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
    print("")
Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Confidence: 1.000

Rule #2
Rule: If a person recommends Empire Strikes Back, The (1980), Fugitive, The (1993) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

Rule #3
Rule: If a person recommends Contact (1997), Empire Strikes Back, The (1980) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

Rule #4
Rule: If a person recommends Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995) they will also recommend Star Wars (1977)
 - Confidence: 1.000

Rule #5
Rule: If a person recommends Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

# Evaluation using test data
test_dataset = all_ratings[~all_ratings['UserID'].isin(range(200))]
test_favorable = test_dataset[test_dataset["Favorable"]]
#test_not_favourable = test_dataset[~test_dataset["Favourable"]]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby("UserID")["MovieID"])
#test_not_favourable_by_users = dict((k, frozenset(v.values)) for k, v in test_not_favourable.groupby("UserID")["MovieID"])
#test_users = test_dataset["UserID"].unique()
test_dataset[:5]
UserID MovieID Rating Datetime Favorable
3 244 51 2 1997-11-27 05:02:03 False
5 298 474 4 1998-01-07 14:20:06 True
7 253 465 5 1998-04-03 18:34:27 True
8 305 451 3 1998-02-01 09:20:17 False
11 286 1014 5 1997-11-17 15:38:45 True
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                   for candidate_rule in rule_confidence}
print(len(test_confidence))
5152
sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(1), reverse=True)
print(sorted_test_confidence[:5])
[((frozenset({64, 1, 7, 79, 50}), 174), 1.0), ((frozenset({64, 1, 98, 7, 79}), 174), 1.0), ((frozenset({64, 1, 7, 172, 79}), 174), 1.0), ((frozenset({64, 1, 7, 79, 181}), 174), 1.0), ((frozenset({64, 1, 172, 79, 56}), 174), 1.0)]
for index in range(10):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name))
    print(" - Train Confidence: {0:.3f}".format(rule_confidence.get((premise, conclusion), -1)))
    print(" - Test Confidence: {0:.3f}".format(test_confidence.get((premise, conclusion), -1)))
    print("")
Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Train Confidence: 1.000
 - Test Confidence: 0.936

Rule #2
Rule: If a person recommends Empire Strikes Back, The (1980), Fugitive, The (1993) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.876

Rule #3
Rule: If a person recommends Contact (1997), Empire Strikes Back, The (1980) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.841

Rule #4
Rule: If a person recommends Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995) they will also recommend Star Wars (1977)
 - Train Confidence: 1.000
 - Test Confidence: 0.932

Rule #5
Rule: If a person recommends Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.903

Rule #6
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Star Wars (1977) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.816

Rule #7
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Train Confidence: 1.000
 - Test Confidence: 0.970

Rule #8
Rule: If a person recommends Toy Story (1995), Silence of the Lambs, The (1991), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Train Confidence: 1.000
 - Test Confidence: 0.933

Rule #9
Rule: If a person recommends Toy Story (1995), Empire Strikes Back, The (1980), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Train Confidence: 1.000
 - Test Confidence: 0.971

Rule #10
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994) they will also recommend Silence of the Lambs, The (1991)
 - Train Confidence: 1.000
 - Test Confidence: 0.794


特征抽取

第一部分

数据集:

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/

完整代码
import os
import pandas as pd
data_folder = os.path.join(os.path.expanduser("~"), "Data", "Adult")
adult_filename = os.path.join(data_folder, "adult.data")
print(adult_filename)
C:\Users\83854\Data\Adult\adult.data
adult = pd.read_csv(adult_filename, header=None, names=["Age", "Work-Class", "fnlwgt", "Education",
                                                        "Education-Num", "Marital-Status", "Occupation",
                                                        "Relationship", "Race", "Sex", "Capital-gain",
                                                        "Capital-loss", "Hours-per-week", "Native-Country",
                                                        "Earnings-Raw"])
adult.dropna(how='all', inplace=True)
adult.columns
Index(['Age', 'Work-Class', 'fnlwgt', 'Education', 'Education-Num',
       'Marital-Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native-Country',
       'Earnings-Raw'],
      dtype='object')
adult["Hours-per-week"].describe()
count    32561.000000
mean        40.437456
std         12.347429
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: Hours-per-week, dtype: float64
adult["Education-Num"].median()
10.0
adult["Work-Class"].unique()
array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)
import numpy as np
X = np.arange(30).reshape((10, 3))
X
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23],
       [24, 25, 26],
       [27, 28, 29]])
X[:,1] = 1
X
array([[ 0,  1,  2],
       [ 3,  1,  5],
       [ 6,  1,  8],
       [ 9,  1, 11],
       [12,  1, 14],
       [15,  1, 17],
       [18,  1, 20],
       [21,  1, 23],
       [24,  1, 26],
       [27,  1, 29]])
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold()
Xt = vt.fit_transform(X)
Xt
array([[ 0,  2],
       [ 3,  5],
       [ 6,  8],
       [ 9, 11],
       [12, 14],
       [15, 17],
       [18, 20],
       [21, 23],
       [24, 26],
       [27, 29]])
print(vt.variances_)
[27.  0. 27.]
X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values
y = (adult["Earnings-Raw"] == ' >50K').values
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
transformer = SelectKBest(score_func=chi2, k=3)
Xt_chi2 = transformer.fit_transform(X, y)
print(transformer.scores_)
[8.60061182e+03 2.40142178e+03 8.21924671e+07 1.37214589e+06
 6.47640900e+03]
from scipy.stats import pearsonr

def multivariate_pearsonr(X, y):
    scores, pvalues = [], []
    for column in range(X.shape[1]):
        cur_score, cur_p = pearsonr(X[:,column], y)
        scores.append(abs(cur_score))
        pvalues.append(cur_p)
    return (np.array(scores), np.array(pvalues))
transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)
Xt_pearson = transformer.fit_transform(X, y)
print(transformer.scores_)
[0.2340371  0.33515395 0.22332882 0.15052631 0.22968907]
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy')
scores_pearson = cross_val_score(clf, Xt_pearson, y, scoring='accuracy')
print("Chi2 performance: {0:.3f}".format(scores_chi2.mean()))
print("Pearson performance: {0:.3f}".format(scores_pearson.mean()))
Chi2 performance: 0.829
Pearson performance: 0.772
from sklearn.base import TransformerMixin
from sklearn.utils import as_float_array

class MeanDiscrete(TransformerMixin):
    def fit(self, X, y=None):
        X = as_float_array(X)
        self.mean = np.mean(X, axis=0)
        return self

    def transform(self, X):
        X = as_float_array(X)
        assert X.shape[1] == self.mean.shape[0]
        return X > self.mean
mean_discrete = MeanDiscrete()
X_mean = mean_discrete.fit_transform(X)
%%file adult_tests.py
import numpy as np
from numpy.testing import assert_array_equal

def test_meandiscrete():
    X_test = np.array([[ 0,  2],
                        [ 3,  5],
                        [ 6,  8],
                        [ 9, 11],
                        [12, 14],
                        [15, 17],
                        [18, 20],
                        [21, 23],
                        [24, 26],
                        [27, 29]])
    mean_discrete = MeanDiscrete()
    mean_discrete.fit(X_test)
    assert_array_equal(mean_discrete.mean, np.array([13.5, 15.5]))
    X_transformed = mean_discrete.transform(X_test)
    X_expected = np.array([[ 0,  0],
                            [ 0, 0],
                            [ 0, 0],
                            [ 0, 0],
                            [ 0, 0],
                            [ 1, 1],
                            [ 1, 1],
                            [ 1, 1],
                            [ 1, 1],
                            [ 1, 1]])
    assert_array_equal(X_transformed, X_expected)
Writing adult_tests.py
test_meandiscrete()
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

Cell In [41], line 1
----> 1 test_meandiscrete()


NameError: name 'test_meandiscrete' is not defined
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('mean_discrete', MeanDiscrete()),
                     ('classifier', DecisionTreeClassifier(random_state=14))])
scores_mean_discrete = cross_val_score(pipeline, X, y, scoring='accuracy')
print("Mean Discrete performance: {0:.3f}".format(scores_mean_discrete.mean()))
Mean Discrete performance: 0.803

第二部分

数据集:

http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements

用神经网络破解验证码

import numpy as np
from PIL import Image, ImageDraw, ImageFont
from skimage import transform as tf
def create_captcha(text, shear=0, size=(100,24)):
    im = Image.new("L", size, "black")
    draw = ImageDraw.Draw(im)
    font = ImageFont.truetype(r"Coval.otf", 22)
    draw.text((2, 2), text, fill=1, font=font)
    image = np.array(im)
    affine_tf = tf.AffineTransform(shear=shear)
    image = tf.warp(image, affine_tf)
    return image / image.max()
%matplotlib inline
from matplotlib import pyplot as plt
image = create_captcha("GENE", shear=0.3)
plt.imshow(image, cmap="gray")
<matplotlib.image.AxesImage at 0x1b3b5c3d060>


[(img-OZ6LJsos-1663901538446)(output_2_1.png)]

from skimage.measure import label, regionprops

def segment_image(image):
    labeled_image = label(image > 0)
    subimages = []
    for region in regionprops(labeled_image):
        start_x, start_y, end_x, end_y = region.bbox
        subimages.append(image[start_x:end_x, start_y:end_y])
    if len(subimages) == 0:
        return [image,]
    return subimages
subimages = segment_image(image)
f, axes = plt.subplots(1, len(subimages), figsize=(10, 3))
for i in range(len(subimages)):
    axes[i].imshow(subimages[i], cmap="gray")


在这里插入图片描述

from sklearn.utils import check_random_state
random_state = check_random_state(14)
letters = list("ACBDEFGHIJKLMNOPQRSTUVWXYZ")
shear_values = np.arange(0, 0.5, 0.05)
def generate_sample(random_state=None):
    random_state = check_random_state(random_state)
    letter = random_state.choice(letters)
    shear = random_state.choice(shear_values)
    return create_captcha(letter, shear=shear, size=(20, 20)), letters.index(letter)
image, target = generate_sample(random_state)
plt.imshow(image, cmap="gray")
print("The target for this image is: {0}".format(target))
The target for this image is: 11

在这里插入图片描述

dataset, targets = zip(*(generate_sample(random_state) for i in
range(3000)))
dataset = np.array(dataset, dtype='float')
targets = np.array(targets)
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
y = onehot.fit_transform(targets.reshape(targets.shape[0],1))
y = y.todense()
from skimage.transform import resize
dataset = np.array([resize(segment_image(sample)[0], (20, 20)) for
sample in dataset])
X = dataset.reshape((dataset.shape[0], dataset.shape[1] *
dataset.shape[2]))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size=0.9)

from pybrain.datasets import SupervisedDataSet


training = SupervisedDataSet(X.shape[1], y.shape[1])
for i in range(X_train.shape[0]):
    training.addSample(X_train[i], y_train[i])
testing = SupervisedDataSet(X.shape[1], y.shape[1])
for i in range(X_test.shape[0]):
    testing.addSample(X_test[i], y_test[i])
import scipy
from pybrain.tools.shortcuts import buildNetwork
net = buildNetwork(X.shape[1], 100, y.shape[1], bias=True)
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(net, training, learningrate=0.01,
weightdecay=0.01)
trainer.trainEpochs(epochs=20)
predictions = trainer.testOnClassData(dataset=testing)

下面这行代码micro部分是添加处理数据问题

from sklearn.metrics import f1_score
print("F-score: {0:.2f}".format(f1_score(predictions, y_test.argmax(axis=1), average='weighted')))
F-score: 0.89
from sklearn.metrics import classification_report
print(classification_report(y_test.argmax(axis=1), predictions))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       0.72      1.00      0.84        13
           2       1.00      0.83      0.91        12
           3       1.00      1.00      1.00        12
           4       0.00      0.00      0.00        13
           5       0.41      1.00      0.58         9
           6       1.00      1.00      1.00        12
           7       1.00      1.00      1.00        12
           8       0.36      1.00      0.53         9
           9       0.00      0.00      0.00        10
          10       1.00      1.00      1.00        13
          11       0.33      0.14      0.20         7
          12       0.92      0.92      0.92        13
          13       0.95      1.00      0.98        20
          14       0.91      1.00      0.95        10
          15       0.90      1.00      0.95        19
          16       1.00      0.50      0.67        12
          17       1.00      1.00      1.00        13
          18       1.00      1.00      1.00        10
          19       1.00      1.00      1.00        11
          20       0.00      0.00      0.00         2
          21       1.00      0.93      0.96        14
          22       1.00      1.00      1.00        12
          23       1.00      1.00      1.00        13
          24       1.00      1.00      1.00         8
          25       1.00      1.00      1.00        13

    accuracy                           0.86       300
   macro avg       0.79      0.82      0.79       300
weighted avg       0.84      0.86      0.84       300

D:\coder\randomnumbers\venv\lib\site-packages\sklearn\metrics\_classification.py:1334: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
D:\coder\randomnumbers\venv\lib\site-packages\sklearn\metrics\_classification.py:1334: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
D:\coder\randomnumbers\venv\lib\site-packages\sklearn\metrics\_classification.py:1334: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
def predict_captcha(captcha_image, neural_network):
    subimages = segment_image(captcha_image)
    predicted_word = ""
    for subimage in subimages:
        subimage = resize(subimage, (20, 20))
        outputs = net.activate(subimage.flatten())
        prediction = np.argmax(outputs)
        predicted_word += letters[prediction]
    return predicted_word
word = "GENE"
captcha = create_captcha(word, shear=0.2)
print(predict_captcha(captcha, net))
ANAA
def test_prediction(word, net, shear=0.2):
    captcha = create_captcha(word, shear=shear)
    prediction = predict_captcha(captcha, net)
    prediction = prediction[:4]
    return word == prediction, word, prediction
from nltk.corpus import words
import nltk



valid_words = [word.upper() for word in words.words() if len(word) == 4]
num_correct = 0
num_incorrect = 0
for word in valid_words:
    correct, word, prediction = test_prediction(word, net, shear=0.2)
    if correct:
        num_correct += 1
    else:
        num_incorrect += 1
print("Number correct is {0}".format(num_correct))
print("Number incorrect is {0}".format(num_incorrect))
Number correct is 57
Number incorrect is 5456
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(np.argmax(y_test, axis=1), predictions)
plt.figure(figsize=(20,20))
plt.imshow(cm, cmap="Blues")
<matplotlib.image.AxesImage at 0x1b3ba7a4f40>

在这里插入图片描述

from nltk.metrics import edit_distance
steps = edit_distance("STEP", "STOP")
print("The number of steps needed is: {0}".format(steps))
The number of steps needed is: 1
def compute_distance(prediction, word):
    return len(prediction) - sum(prediction[i] == word[i] for i in range(len(prediction)))
from operator import itemgetter
def improved_prediction(word, net, dictionary, shear=0.2):
    captcha = create_captcha(word, shear=shear)
    prediction = predict_captcha(captcha, net)
    prediction = prediction[:4]
    if prediction not in dictionary:
        distances = sorted([(word, compute_distance(prediction, word))
                            for word in dictionary],
                           key=itemgetter(1))
        best_word = distances[0]
        prediction = best_word[0]
    return word == prediction, word, prediction
num_correct = 0
num_incorrect = 0
for word in valid_words:
    correct, word, prediction = improved_prediction (word, net, valid_words, shear=0.2)
    if correct:
        num_correct += 1
    else:
        num_incorrect += 1
print("Number correct is {0}".format(num_correct))
print("Number incorrect is {0}".format(num_incorrect))
Number correct is 123
Number incorrect is 5390

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

【数据挖掘】(一)用jupyter编程 的相关文章

随机推荐

  • GPU与CPU版本的矩阵乘法对比

    转载自 http www cnblogs com stormhan p 5467187 html 由于刚刚开始学习Cuda 还没有整理出一个完整的Cuda类 只是在Nvidia提供的kenerl架构上做修改 但用于初体验GPU给我们带来的好
  • OSG+MyGUI结合下中文输入的若干问题

    好久又没写文章了 经过这两天的研究 终于搞定了OSG MYGUI的中文输入问题 现在进行一下所遇问题的汇总 方便自己以后查阅 问题1 小键盘数字的输入问题 这个是很久以前解决的 osg并不能给你真正的小键盘输入值 还是得自己处理 代码片段1
  • 给360安全浏览器设置一个图片背景/主题

    此方法适用于360安全浏览器 其他浏览器请移步 给浏览器设置一个图片背景 主题 看看效果 1 点击 管理 gt 添加 gt 搜索stylebot gt 安装 2 安装好了会有小图标 S 或者小图标 css 点击 S css图标 gt Opt
  • 测试用例应该怎么写

    一 背景 有些测试同学 写测试用例的时候 直接就是将需求文档上的内容抄一遍 转换成测试用例的格式 没有加入任何自己的思考和理解 没有融入任何测试方法论 测试完全依赖于需求文档的质量 依赖于产品经理保姆级的服务 需求写得细 测试用例覆盖就全面
  • 详解numpy.random.shuffle函数

    文章目录 函数原型 参数解析 该函数的注意事项 例子 示例代码 示例结果 参考 numpy的random模块中的shuffle函数用于np ndarray数组中的元素打乱顺序 进打乱多维数组的第一维顺序 本博客详细节将该函数的API 并给出
  • Hive---拉链表设计与实现

    1 数据同步问题 Hive在实际工作中主要用于构建离线数据仓库 定期的从各种数据源中同步采集数据到Hive中 经过分层转换提供数据应用 比如每天需要从MySQL中同步最新的订单信息 用户信息 店铺信息等到数据仓库中 进行订单分析 用户分析
  • [Linux系统编程]守护进程/线程(四)

    距离上一次利用高并发技术实现360度行车记录仪功能已经过去半年了 开始写一系列关于系统编程和网络编程内容进行总结 温故而知新 欢迎大家讨论学习 2021 09 05 补充 1 dup2与dup区别是dup2可以用参数newfd指定新文件描述
  • java并发基础(二)

    java并发编程实战 终于读完4 7章了 感触很深 但是有些东西还没有吃透 先把已经理解的整理一下 java并发基础 一 是对前3章的总结 这里总结一下第4 5章的东西 一 java监视器模式 概念 把对象的所有可变状态都封装起来 并由对象
  • 大数据开发教程——Apache Hive实战

    Hive 建表高阶语句 CTAS and CTE 重点 CTAS Create Table As Select CREATE TABLE ctas employee as SELECT FROM employee 基于select查询的结果
  • linux下dig命令安装,在Linux系统上安装和使用dig和nslookup命令

    1 前言 在本文中 您将学习如何在linux上安装dig命令和nslookup命令以及如何使用这些命令查找域名相关的信息 这些命令用于网络故障排除和收集有关域名的信息 dig是域名信息Gopher的缩写 是一个DNS查找工具 用于探测DNS
  • python自动化笔记(七)局部变量和递归

    my num 200 全局变量 def my func my num 10 局部变量 函数内部有同名的局部变量 优先采用局部变量 print my num def my func1 global num 局部变量转换为全局变量 也可修改全局
  • IDEA打包Maven项目失败-InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty

    IDEA中运行Maven package报错 Could not transfer artifact org apache maven plugins maven resources plugin pom 2 6 from to centr
  • 单线多拨插件安装_折腾小日记三:爱快PKopenwrt多拨实测(更新版)

    前言 受困于运营商较小的带宽或者高昂的资费 网络爱好者都非常热衷于通过多拨来提升网速 但是现在绝大部分地区都限制了同时拨号设备数量 为了能多拨只能使用稍微复杂的 并发多拨 的方法 并发多拨让多个虚拟wan同时拨号 在运营商的账号在线检测还没
  • 手动下载maven依赖

    最近做SpringBoot项目时 需要添加spring boot admin starter server 依赖 刷新Maven后提示找不到这个依赖的版本 尝试各种办法 更换私服地址 删除缓存等 后是还是无法下载 只能到Maven的中央仓库
  • C语言超市计价收款系统

    学习一个月的C语言 写了一个简易的超市计价收款系统 源码如下 include
  • Anaconda常用操作命令

    Anaconda常用命令操作指南 参考链接 Anaconda完全入门指南 Anaconda常用命令总结 1 更换国内像源 conda config add channels https mirrors tuna tsinghua edu c
  • OMNI USDT 0.12.0 环境部署

    文章目录 一 生成Omni Core v0 12 0版本镜像 二 启动Omni Core v0 12 0版本容器 一 生成Omni Core v0 12 0版本镜像 编写Dockerfile cd opt docker usdt image
  • 软件测试中静态测试和动态测试的区别

    1 测试的部分不同 静态测试是指测试不运行的部分 只是检查和审阅 如规范测试 软件模型测试 文档测试等 动态测试是通常意义上的测试 也就是运行和使用软件 2 测试方式不同 静态测试 通过评审文档 阅读代码等方式测试软件称为静态测试 通过运行
  • 【Python入门系列】第十四篇:Python Web开发

    文章目录 前言 一 PythonWeb开发简介 二 开发准备工作 三 开发步骤 四 开发案例 1 使用Flask框架创建一个简单的Web应用程序 2 使用Django框架创建一个简单的待办事项应用程序 3 使用Flask框架创建一个简单的博
  • 【数据挖掘】(一)用jupyter编程

    为熟悉jupyter 找了一本书练习 参考资料 Python数据挖掘入门与实践 数据集 https github com packtpublishing learning data mining with python 第一行代码 impo