进行一些测试表明,当使用 format="%H:%M:%S.%f" 格式化日期时间数据时,如果小数点后的第九位非零,%f 能够实现纳秒分辨率。格式化字符串时,根据小数点后最低有效数字的位置并考虑到它也是最终数字,添加从无到五个的可变数量的尾随零。下面是测试数据中的表格,其中位置是最低有效非零的位置,也是最终数字,零是通过格式化添加的尾随零的数量:
position zeros
9 0
8 1
7 2
6 0
5 1
4 2
3 3
2 4
1 5
当一列作为一个整体被格式化为“%H:%M:%S.%f”时,它的所有元素在小数点后将具有相同的位数,这可以通过添加或删除尾随零来完成如果这增加或减少了原始数据的分辨率。我想这样做的原因是一致性和令人愉悦的美观性,通常不会引入过多的误差,因为在数值计算中尾随零通常不会影响即时结果,但是它们会影响对其误差的估计以及它们应该如何呈现(尾随零 http://academic.umf.maine.edu/magri/PUBLIC.acd/tools/SigFigsAndRounding.html#TrailingZeros, 有效数字的规则 http://ccnmtl.columbia.edu/projects/mmt/frontiers/web/chapter_5/6665.html).
以下是使用 pandas.to_datetime 将“%H:%M:%S.%f”格式应用于单个字符串和 pandas.Series(DataFrame 列)以及应用 pandas.DataFrame.convert_objects(convert_dates='coerce') 的一些观察结果具有可转换为日期时间的列的 DataFrame。
在字符串上,pandas 使用“%H:%M:%S.%f”在时间转换中保留小数点后第九位的非零数字,如果未提供日期,则添加日期:
import pandas as pd
pd.to_datetime ("10:00:00.000000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000001')
pd.to_datetime ("2015-09-17 10:00:00.000000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[15]: Timestamp('2015-09-17 10:00:00.000000001')
在此之前,对于最终非零数字为最终数字的测试,它会在最终非零数字之后添加五个尾随零,以提高原始数据的分辨率,除非最终非零数字位于小数点右边第六位:
pd.to_datetime ("10:00:00.00000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000010')
pd.to_datetime ("2015-09-17 10:00:00.00000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[16]: Timestamp('2015-09-17 10:00:00.000000010')
pd.to_datetime ("10:00:00.0000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000100')
pd.to_datetime ("2015-09-17 10:00:00.0000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[17]: Timestamp('2015-09-17 10:00:00.000000100')
pd.to_datetime ("10:00:00.000001",format="%H:%M:%S.%f")
Out[33]: Timestamp('1900-01-01 10:00:00.000001')
pd.to_datetime ("2015-09-17 10:00:00.000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[18]: Timestamp('2015-09-17 10:00:00.000001')
pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")
Out[6]: Timestamp('1900-01-01 10:00:00.000010')
pd.to_datetime ("2015-09-17 10:00:00.00001",format="%Y-%m-%d %H:%M:%S.%f")
Out[19]: Timestamp('2015-09-17 10:00:00.000010')
pd.to_datetime ("10:00:00.0001",format="%H:%M:%S.%f")
Out[9]: Timestamp('1900-01-01 10:00:00.000100')
pd.to_datetime ("2015-09-17 10:00:00.0001",format="%Y-%m-%d %H:%M:%S.%f")
Out[21]: Timestamp('2015-09-17 10:00:00.000100')
pd.to_datetime ("10:00:00.001",format="%H:%M:%S.%f")
Out[10]: Timestamp('1900-01-01 10:00:00.001000')
pd.to_datetime ("2015-09-17 10:00:00.001",format="%Y-%m-%d %H:%M:%S.%f")
Out[22]: Timestamp('2015-09-17 10:00:00.001000')
pd.to_datetime ("10:00:00.01",format="%H:%M:%S.%f")
Out[12]: Timestamp('1900-01-01 10:00:00.010000')
pd.to_datetime ("2015-09-17 10:00:00.01",format="%Y-%m-%d %H:%M:%S.%f")
Out[24]: Timestamp('2015-09-17 10:00:00.010000'
pd.to_datetime ("10:00:00.1",format="%H:%M:%S.%f")
Out[13]: Timestamp('1900-01-01 10:00:00.100000')
pd.to_datetime ("2015-09-17 10:00:00.1",format="%Y-%m-%d %H:%M:%S.%f")
Out[26]: Timestamp('2015-09-17 10:00:00.100000')
让我们看看它如何与 DataFrame 一起使用:
!type test.csv # here type is Windows substitute for Linux cat command
date,mesg
10:00:00.000000001,one
10:00:00.00000001,two
10:00:00.0000001,three
10:00:00.000001,four
10:00:00.00001,five
10:00:00.0001,six
10:00:00.001,seven
10:00:00.01,eight
10:00:00.1,nine
10:00:00.000000001,ten
10:00:00.000000002,eleven
10:00:00.000000003,twelve
df = pd.read_csv('test.csv')
df
Out[30]:
date mesg
0 10:00:00.000000001 one
1 10:00:00.00000001 two
2 10:00:00.0000001 three
3 10:00:00.000001 four
4 10:00:00.00001 five
5 10:00:00.0001 six
6 10:00:00.001 seven
7 10:00:00.01 eight
8 10:00:00.1 nine
9 10:00:00.000000001 ten
10 10:00:00.000000002 eleven
11 10:00:00.000000003 twelve
df.dtypes
Out[31]:
date object
mesg object
dtype: object
使用convert_objects对DataFrame进行日期时间转换(没有格式选项),即使某些原始数据的分辨率低于或高于该分辨率,也可以提供微秒分辨率,并添加今天的日期:
df2 = df.convert_objects(convert_dates='coerce')
df2
Out[32]:
date mesg
0 2015-09-17 10:00:00.000000 one
1 2015-09-17 10:00:00.000000 two
2 2015-09-17 10:00:00.000000 three
3 2015-09-17 10:00:00.000001 four
4 2015-09-17 10:00:00.000010 five
5 2015-09-17 10:00:00.000100 six
6 2015-09-17 10:00:00.001000 seven
7 2015-09-17 10:00:00.010000 eight
8 2015-09-17 10:00:00.100000 nine
9 2015-09-17 10:00:00.000000 ten
10 2015-09-17 10:00:00.000000 eleven
11 2015-09-17 10:00:00.000000 twelve
df2.dtypes
Out[33]:
date datetime64[ns]
mesg object
dtype: object
从原始数据创建的 DataFrame 列中元素值的更高分辨率(其中一些具有大于微秒的分辨率)在日期时间转换完成后没有显式格式说明符(即与 DataFrame.convert_objects 一起使用):
df2['date'] = pd.to_datetime(df2['date'],format="%H:%M:%S.%f")
df2
Out[34]:
date mesg
0 2015-09-17 10:00:00.000000 one
1 2015-09-17 10:00:00.000000 two
2 2015-09-17 10:00:00.000000 three
3 2015-09-17 10:00:00.000001 four
4 2015-09-17 10:00:00.000010 five
5 2015-09-17 10:00:00.000100 six
6 2015-09-17 10:00:00.001000 seven
7 2015-09-17 10:00:00.010000 eight
8 2015-09-17 10:00:00.100000 nine
9 2015-09-17 10:00:00.000000 ten
10 2015-09-17 10:00:00.000000 eleven
11 2015-09-17 10:00:00.000000 twelve
在日期时间转换之前使用“%H:%M:%S.%f”格式化 DataFrame 列可提供至少一个元素在第九位具有非零数字的纳秒分辨率(如广告中所述)pandas.to_datetime 文档 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html),同时还将小于纳秒分辨率的原始数据分辨率提高到该级别,并添加 1900-01-01 作为日期:
df3 = df.copy(deep=True)
df3['date'] = pd.to_datetime(df3['date'],format="%H:%M:%S.%f",coerce=True)
df3
Out[35]:
date mesg
0 1900-01-01 10:00:00.000000001 one
1 1900-01-01 10:00:00.000000010 two
2 1900-01-01 10:00:00.000000100 three
3 1900-01-01 10:00:00.000001000 four
4 1900-01-01 10:00:00.000010000 five
5 1900-01-01 10:00:00.000100000 six
6 1900-01-01 10:00:00.001000000 seven
7 1900-01-01 10:00:00.010000000 eight
8 1900-01-01 10:00:00.100000000 nine
9 1900-01-01 10:00:00.000000001 ten
10 1900-01-01 10:00:00.000000002 eleven
11 1900-01-01 10:00:00.000000003 twelve
使用“%H:%M:%S.%f”格式化 DataFrame 列在小数点后具有最低有效非零数字的数据后添加零(在整个列上,并且根据位置:零表添加零上面)并将所有其他数据的分辨率与该分辨率对齐,即使这样做会增加或降低某些原始数据的分辨率:
df4 = pd.read_csv('test2.csv')
df4
Out[36]:
date mesg
0 10:00:00.000000000 one
1 10:00:00.00000000 two
2 10:00:00.0000000 three
3 10:00:00.000000 four
4 10:00:00.00000 five
5 10:00:00.0001 six
6 10:00:00.00 seven
7 10:00:00.0 eight
8 10:00:00. nine
9 10:00:00.000000000 ten
10 10:00:00.000000000 eleven
11 10:00:00.00000000 twelve
df4['date'] = pd.to_datetime(df4['date'],format="%H:%M:%S.%f",coerce=True)
df4
Out[37]:
date mesg
0 1900-01-01 10:00:00.000000 one
1 1900-01-01 10:00:00.000000 two
2 1900-01-01 10:00:00.000000 three
3 1900-01-01 10:00:00.000000 four
4 1900-01-01 10:00:00.000000 five
5 1900-01-01 10:00:00.000100 six
6 1900-01-01 10:00:00.000000 seven
7 1900-01-01 10:00:00.000000 eight
8 NaT nine # nothing after decimal point in raw data
9 1900-01-01 10:00:00.000000 ten
10 1900-01-01 10:00:00.000000 eleven
11 1900-01-01 10:00:00.000000 twelve
当使用相同的 DataFrame 尝试执行此操作但日期列中包含日期时,发生了同样的事情:
df25
Out[38]:
date mesg
0 2015-09-10 10:00:00.000000000 one
1 2015-09-11 10:00:00.00000000 two
2 2015-09-12 10:00:00.0000000 three
3 2015-09-13 10:00:00.000000 four
4 2015-09-14 10:00:00.00000 five
5 2015-09-15 10:00:00.0001 six
6 2015-09-16 10:00:00.00 seven
7 2015-09-17 10:00:00.0 eight
8 2015-09-18 10:00:00. nine
9 2015-09-19 10:00:00.000000000 ten
10 2015-09-20 10:00:00.000000000 eleven
11 2015-09-21 10:00:00.00000000 twelve
df25['date'] = pd.to_datetime(df25['date'],format="%Y-%m-%d %H:%M:%S.%f",coerce=True)
df25
Out[39]:
date mesg
0 2015-09-10 10:00:00.000000 one
1 2015-09-11 10:00:00.000000 two
2 2015-09-12 10:00:00.000000 three
3 2015-09-13 10:00:00.000000 four
4 2015-09-14 10:00:00.000000 five
5 2015-09-15 10:00:00.000100 six
6 2015-09-16 10:00:00.000000 seven
7 2015-09-17 10:00:00.000000 eight
8 NaT nine # nothing after decimal point in raw data
9 2015-09-19 10:00:00.000000 ten
10 2015-09-20 10:00:00.000000 eleven
11 2015-09-21 10:00:00.000000 twelve
当小数点后没有原始数据具有非零有效数字时,使用 DataFrame 列“%H:%M:%S.%f”进行格式化可能会统一为所有数据在小数点后仅提供两个零,即使增加或减少一些原始数据的分辨率:
df5 = pd.read_csv('test3.csv')
df5
Out[40]:
date mesg
0 10:00:00.000 one
1 10:00:00.0 two
2 10:00:00.000 three
3 10:00:00.000 four
4 10:00:00.00 five
5 10:00:00.000 six
6 10:00:00.00 seven
7 10:00:00.0 eight
8 10:00:00.0 nine
9 10:00:00.000000000 ten
10 10:00:00.000 eleven
11 10:00:00.000 twelve
df5['date'] = pd.to_datetime(df5['date'],format="%H:%M:%S.%f",coerce=True)
df5
Out[41]:
date mesg
0 1900-01-01 10:00:00 one
1 1900-01-01 10:00:00 two
2 1900-01-01 10:00:00 three
3 1900-01-01 10:00:00 four
4 1900-01-01 10:00:00 five
5 1900-01-01 10:00:00 six
6 1900-01-01 10:00:00 seven
7 1900-01-01 10:00:00 eight
8 1900-01-01 10:00:00 nine
9 1900-01-01 10:00:00 ten
10 1900-01-01 10:00:00 eleven
11 1900-01-01 10:00:00 twelve
使用相同的 DataFrame 但日期列中包含日期进行测试时,发生了同样的事情:
df45
Out[42]:
date mesg
0 2015-09-10 10:00:00.000 one
1 2015-09-11 10:00:00.0 two
2 2015-09-12 10:00:00.000 three
3 2015-09-13 10:00:00.000 four
4 2015-09-14 10:00:00.00 five
5 2015-09-15 10:00:00.000 six
6 2015-09-16 10:00:00.00 seven
7 2015-09-17 10:00:00.0 eight
8 2015-09-18 10:00:00.0 nine
9 2015-09-19 10:00:00.000000000 ten
10 2015-09-20 10:00:00.000 eleven
11 2015-09-21 10:00:00.000 twelve
df45['date'] = pd.to_datetime(df45['date'],format="%Y-%m-%d %H:%M: %S.%f",coerce=True)
df45
Out[43]:
date mesg
0 2015-09-10 10:00:00 one
1 2015-09-11 10:00:00 two
2 2015-09-12 10:00:00 three
3 2015-09-13 10:00:00 four
4 2015-09-14 10:00:00 five
5 2015-09-15 10:00:00 six
6 2015-09-16 10:00:00 seven
7 2015-09-17 10:00:00 eight
8 2015-09-18 10:00:00 nine
9 2015-09-19 10:00:00 ten
10 2015-09-20 10:00:00 eleven
11 2015-09-21 10:00:00 twelve