pandas6:DataFrame非值数据(Nan)的处理

2023-11-07

Pandas中有哪些非值数据

1. NaN 是什么

NaN是被遗失的,不属于任何类型

from numpy import NaN,nan
print(nan)
nan
print(NaN==True)
print(NaN==False)
print(NaN==0)
print(NaN=='')
print(NaN==NaN)
print(NaN==nan)
False
False
False
False
False
False
import pandas as pd
x = NaN
y = nan
n = 20
print(pd.isnull(x))
print(pd.isnull(y))
print(pd.notnull(n))
True
True
True

2.数据遗失的原因

2.1数据缺失造成的NaN


import pandas as pd
# 装载数据

visited_file = './data/survey_visited.csv'
print(pd.read_csv(visited_file))

   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# 不将控制设置为NaN
print(pd.read_csv(visited_file,na_values=[' '],keep_default_na=False))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3            
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

2.2合并数据导致的NaN

# 合并数据
visited = pd.read_csv('./data/survey_visited.csv')
survey = pd.read_csv('./data/survey_survey.csv')
print(visited)
print(survey)
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
    taken person quant  reading
0     619   dyer   rad     9.82
1     619   dyer   sal     0.13
2     622   dyer   rad     7.80
3     622   dyer   sal     0.09
4     734     pb   rad     8.41
5     734   lake   sal     0.05
6     734     pb  temp   -21.50
7     735     pb   rad     7.22
8     735    NaN   sal     0.06
9     735    NaN  temp   -26.00
10    751     pb   rad     4.35
11    751     pb  temp   -18.50
12    751   lake   sal     0.10
13    752   lake   rad     2.19
14    752   lake   sal     0.09
15    752   lake  temp   -16.00
16    752    roe   sal    41.60
17    837   lake   rad     1.46
18    837   lake   sal     0.21
19    837    roe   sal    22.50
20    844    roe   rad    11.25
vs = visited.merge(survey,left_on='ident',right_on='taken')
print(vs)
    ident   site       dated  taken person quant  reading
0     619   DR-1  1927-02-08    619   dyer   rad     9.82
1     619   DR-1  1927-02-08    619   dyer   sal     0.13
2     622   DR-1  1927-02-10    622   dyer   rad     7.80
3     622   DR-1  1927-02-10    622   dyer   sal     0.09
4     734   DR-3  1939-01-07    734     pb   rad     8.41
5     734   DR-3  1939-01-07    734   lake   sal     0.05
6     734   DR-3  1939-01-07    734     pb  temp   -21.50
7     735   DR-3  1930-01-12    735     pb   rad     7.22
8     735   DR-3  1930-01-12    735    NaN   sal     0.06
9     735   DR-3  1930-01-12    735    NaN  temp   -26.00
10    751   DR-3  1930-02-26    751     pb   rad     4.35
11    751   DR-3  1930-02-26    751     pb  temp   -18.50
12    751   DR-3  1930-02-26    751   lake   sal     0.10
13    752   DR-3         NaN    752   lake   rad     2.19
14    752   DR-3         NaN    752   lake   sal     0.09
15    752   DR-3         NaN    752   lake  temp   -16.00
16    752   DR-3         NaN    752    roe   sal    41.60
17    837  MSK-4  1932-01-14    837   lake   rad     1.46
18    837  MSK-4  1932-01-14    837   lake   sal     0.21
19    837  MSK-4  1932-01-14    837    roe   sal    22.50
20    844   DR-1  1932-03-22    844    roe   rad    11.25

2.3用户输入到时的NaN


scientisits = pd.DataFrame({
    'Name':['Bill','Mike'],
    'Occupation':['Chemist','Statist'],
})
print(scientisits)
   Name Occupation
0  Bill    Chemist
1  Mike    Statist
from numpy import nan

scientisits['missing'] = nan
print(scientisits)
   Name Occupation  missing
0  Bill    Chemist      NaN
1  Mike    Statist      NaN

2.4重建索引


gapminder = pd.read_csv('./data/gapminder.tsv',sep='\t')
print(gapminder.head(5))
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)
year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

缺失部分数据为NaN

print(life_exp.loc[range(2000,2003)])
year
2000          NaN
2001          NaN
2002    65.694923
Name: lifeExp, dtype: float64

3.处理非值数据

3.1填充NaN

import pandas as pd
gapminder = pd.read_csv('./data/gapminder.tsv',sep='\t')
print(gapminder.head(5))
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
a = life_exp.loc[range(2000,2010)]
print(a)
year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64

填充指定值

print(a.fillna(0))
print(a.fillna('*'))
year
2000     0.000000
2001     0.000000
2002    65.694923
2003     0.000000
2004     0.000000
2005     0.000000
2006     0.000000
2007    67.007423
2008     0.000000
2009     0.000000
Name: lifeExp, dtype: float64
year
2000          *
2001          *
2002    65.6949
2003          *
2004          *
2005          *
2006          *
2007    67.0074
2008          *
2009          *
Name: lifeExp, dtype: object

forward 填充


print(a.fillna(method='ffill'))
year
2000          NaN
2001          NaN
2002    65.694923
2003    65.694923
2004    65.694923
2005    65.694923
2006    65.694923
2007    67.007423
2008    67.007423
2009    67.007423
Name: lifeExp, dtype: float64

backward 填充


print(a.fillna(method='bfill'))
year
2000    65.694923
2001    65.694923
2002    65.694923
2003    67.007423
2004    67.007423
2005    67.007423
2006    67.007423
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64

先bfill后ffill填充

print(a.fillna(method='bfill').fillna(method='ffill'))
year
2000    65.694923
2001    65.694923
2002    65.694923
2003    67.007423
2004    67.007423
2005    67.007423
2006    67.007423
2007    67.007423
2008    67.007423
2009    67.007423
Name: lifeExp, dtype: float64

线性插值

print(a)
print(a.interpolate())
year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64
year
2000          NaN
2001          NaN
2002    65.694923
2003    65.957423
2004    66.219923
2005    66.482423
2006    66.744923
2007    67.007423
2008    67.007423
2009    67.007423
Name: lifeExp, dtype: float64
aa = pd.Series([NaN,NaN,2,NaN,4,NaN,6,NaN,8,NaN,NaN,NaN])
print(aa.interpolate())
0     NaN
1     NaN
2     2.0
3     3.0
4     4.0
5     5.0
6     6.0
7     7.0
8     8.0
9     8.0
10    8.0
11    8.0
dtype: float64

3.2删除包含NaN的行

print(a)
print(a.dropna())
year
2000          NaN
2001          NaN
2002    65.694923
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    67.007423
2008          NaN
2009          NaN
Name: lifeExp, dtype: float64
year
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

pandas6:DataFrame非值数据(Nan)的处理 的相关文章

  • spring.jpa.hibernate.ddl-auto的配置

    spring jpa hibernate ddl auto 可以显式设置 spring jpa hibernate ddl auto 标准的Hibernate属性值有 none validate update create create d
  • mysql之 mysql 5.6不停机双主一从搭建(活跃双主一从基于日志点复制)

    环境说明 版本 version 5 6 25 log 主1库ip 10 219 24 25主2库ip 10 219 24 22从1库ip 10 219 24 26os 版本 centos 6 7已安装热备软件 xtrabackup 防火墙已
  • A template class for binding C++ to Lua

    A template class for binding C to Lua 标签 classc bindingconstructorluafunction 2006 09 09 15 50 1397人阅读 评论 0 收藏 举报 目录 htt
  • OpenMMLab-AI实战营第二期-人体关键点检测与MMPose

    人体关键点检测与MMPose 课程链接 https www bilibili com video BV1kk4y1L7Xb 这个课程的大致内容是介绍如何从给定的二维影像中恢复出人体的姿态 2D或者3D 大纲如下所示 基本上可以认为流程是 先
  • hadoop的DFSOutputStream

    当我们用命令 hadoop fs copyFromLocal localfile hdfs 将本地文件复制到HDFS时 其背后的复制过程是怎样的 本地文件通过什么方式传输到datanode上的呢 这里面很显然的是 1 文件在多个电脑之间进行
  • 基于clickhouse做用户画像,标签圈选

    clickhouse在做用户画像标签时 怎么去做圈选 表结构应该是怎么样的 我们应该怎么去处理 能够使其高性能的圈选 尽可能缩小其占用的存储空间 这个问题 我通过代码给大家做下的演示 先在hive中对数据预处理 最初表结构 create t
  • python/pta 7-42 纵横

    7 42 纵横 莫大侠练成纵横剑法 走上了杀怪路 每次仅出一招 这次 他遇到了一个正方形区域 由n n个格子构成 每个格子 行号 列号都从1开始编号 中有若干个怪 莫大侠施展幻影步 抢占了一个格子 使出绝招 横扫四方 就把他上 下 左 右四
  • 眼底图像血管增强与分割--(4)基于自适应对比度增强算法实现

    在 http blog csdn net piaoxuezhong article details 78385517 中介绍的自适应对比度增强算法 其基本原理是将图像分为低频背景和高频细节两部分 算法选择高频部分进行增益放大 这样就增强了细
  • 修复“net::err_cert_authority_invalid”错误

    1 背景 在请求接口时接口报错net err cert authority invalid 当您的浏览器无法验证您网站的SSL证书的有效性时 就会出现此问题 如果您尚未设置证书或为您的网站使用HTTP 不推荐 则不应遇到此错误 2 解决办法
  • 利用Python子进程关闭Excel自动化过程出现的弹窗

    利用Python进行Excel自动化操作的过程中 尤其是涉及VBA时 可能遇到消息框 弹窗 MsgBox 此时需要人为响应 否则代码卡死直至超时 1 2 根本的解决方法是VBA代码中不要出现类似弹窗 但有时我们无权修改被操作的Excel文件

随机推荐