经过一番痛苦的调试后,我可以确认慢速程序所采取的顺序,在数据帧构造函数 https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L299 :
elif isinstance(data, (list, types.GeneratorType)):
if isinstance(data, types.GeneratorType):
data = list(data)
if len(data) > 0:
if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
if is_named_tuple(data[0]) and columns is None:
columns = data[0]._fields
arrays, columns = _to_arrays(data, columns, dtype=dtype)
这里它测试传递数据的类型,因为它类似于列表,然后尝试测试每个元素的类型,它不期望包含 np 数组的列表,所以它来到这里:
def _to_arrays(data, columns, coerce_float=False, dtype=None):
"""
Return list of arrays, columns
"""
if isinstance(data, DataFrame):
if columns is not None:
arrays = [data._ixs(i, axis=1).values
for i, col in enumerate(data.columns) if col in columns]
else:
columns = data.columns
arrays = [data._ixs(i, axis=1).values for i in range(len(columns))]
return arrays, columns
if not len(data):
if isinstance(data, np.ndarray):
columns = data.dtype.names
if columns is not None:
return [[]] * len(columns), columns
return [], [] # columns if columns is not None else []
if isinstance(data[0], (list, tuple)):
return _list_to_arrays(data, columns, coerce_float=coerce_float,
dtype=dtype)
然后在这里:
def _list_to_arrays(data, columns, coerce_float=False, dtype=None):
if len(data) > 0 and isinstance(data[0], tuple):
content = list(lib.to_object_array_tuples(data).T)
else:
# list of lists
content = list(lib.to_object_array(data).T)
return _convert_object_array(content, columns, dtype=dtype,
coerce_float=coerce_float)
最后在这里:
def _convert_object_array(content, columns, coerce_float=False, dtype=None):
if columns is None:
columns = _default_index(len(content))
else:
if len(columns) != len(content): # pragma: no cover
# caller's responsibility to check for this...
raise AssertionError('%d columns passed, passed data had %s '
'columns' % (len(columns), len(content)))
# provide soft conversion of object dtypes
def convert(arr):
if dtype != object and dtype != np.object:
arr = lib.maybe_convert_objects(arr, try_float=coerce_float)
arr = _possibly_cast_to_datetime(arr, dtype)
return arr
arrays = [convert(arr) for arr in content]
return arrays, columns
您可以看到它执行的构造没有优化,它本质上只是迭代每个元素,转换它(这将复制它)并返回一个数组列表。
对于另一条路径,由于 np 数组形状和 dtypes 对 pandas 更友好,它可以查看数据或根据需要进行复制,但它已经知道足够的信息来优化构造