如何避免单元测试中的浮点舍入错误？

2024-03-16

我正在尝试为一些对单精度浮点数数组进行操作的简单向量数学函数编写单元测试。这些函数使用 SSE 内在函数，并且在 32 位系统上运行测试时出现误报（至少我认为）（测试在 64 位上通过）。当操作遍历数组时，我积累了越来越多的舍入误差。这是单元测试代码和输出的片段（我的实际问题如下）：

测试设置：

static const int N = 1024;
static const float MSCALAR = 42.42f;

static void setup(void) {
    input = _mm_malloc(sizeof(*input) * N, 16);
    ainput = _mm_malloc(sizeof(*ainput) * N, 16);
    output = _mm_malloc(sizeof(*output) * N, 16);
    expected = _mm_malloc(sizeof(*expected) * N, 16);

    memset(output, 0, sizeof(*output) * N);

    for (int i = 0; i < N; i++) {
        input[i] = i * 0.4f;
        ainput[i] = i * 2.1f;
        expected[i] = (input[i] * MSCALAR) + ainput[i];
    }
}

然后，我的主要测试代码调用要测试的函数（该函数执行与生成expected数组）并检查其输出expected上面生成的数组。检查的是接近度（0.0001 以内）而不是相等性。

示例输出：

0.000000    0.000000    delta: 0.000000
44.419998   44.419998   delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...

我知道我有两个问题：

在 32 位机器上，387 和 SSE 浮点运算单元之间的差异。我相信 387 使用更多位作为中间值。
我用来生成预期值的 42.42 值的非精确表示。

所以我的问题是，写有意义的正确方法是什么and对浮点数据进行数学运算的便携式单元测试？

*我所说的“可移植”是指应该同时支持 32 位和 64 位架构。

根据评论，我们看到正在测试的功能本质上是：

for (int i = 0; i < N; ++i)
    D[i] = A[i] * b + C[i];

where A[i], b, C[i], and D[i]都有类型float。当引用单次迭代的数据时，我会使用a, c, and d for A[i], C[i], and D[i].

下面是我们在测试此函数时可用于容错的分析。不过，首先我想指出，我们可以设计测试以确保不出现错误。我们可以选择的值A[i], b, C[i], and D[i]以便所有结果，无论是最终结果还是中间结果，都可以精确表示，并且不存在舍入误差。显然，这不会测试浮点运算，但这不是目标。目标是测试函数的代码：它是否执行计算所需函数的指令？只需选择能够揭示使用正确数据、加法、乘法或存储到正确位置的任何失败的值就足以揭示函数中的错误。我们相信硬件能够正确执行浮点运算，并且不会对此进行测试；我们只是想测试函数是否正确编写。为了实现这一点，我们可以，例如，设置b为 2 的幂，A[i]到各种小整数，并且C[i]各种小整数乘以b。如果需要，我可以更精确地详细说明这些值的限制。那么所有的结果都会是准确的，并且任何允许比较的公差的需要都会消失。

除此之外，让我们继续进行错误分析。

目标是发现功能实现中的错误。为此，我们可以忽略浮点运算中的小错误，因为我们正在寻找的错误类型几乎总是会导致大错误：使用了错误的操作，使用了错误的数据，或者结果没有存储在浮点运算中。期望的位置，因此实际结果几乎总是与预期结果有很大不同。

现在的问题是我们应该容忍多少错误？因为bug通常会导致较大的错误，所以我们可以将容差设置得相当高。然而，在浮点数中，“高”仍然是相对的。与数万亿的值相比，一百万的误差很小，但当输入值是数万亿时，它太大而无法发现错误。所以我们至少应该做一些分析来决定水平。

The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2^-24, provided all the values are in the normal range of the floating-point format. (2^-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)

接下来，我们考虑测试程序为其预期值产生什么值。它使用C代码d = a*b + c;。（我已将问题中的长名称转换为较短的名称。）理想情况下，这还会计算 IEEE-754 32 位二进制浮点中的乘法和加法。如果确实如此，那么结果将与正在测试的功能相同，并且不需要在比较中允许任何容差。然而，C 标准允许实现在执行浮点运算时具有一定的灵活性，并且存在不符合标准的实现，它们比标准允许的自由度更大。

A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2^-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2^-24.

Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2^-24+2^-53+2^-48)+|c|•(2•2^-24+2^-53+2^-77).

Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2^-24+|c|•2.001•2^-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.

请注意，表达式与最终值不成比例，a*b+c，由正在测试的函数或测试程序计算得出。这意味着，一般来说，使用相对于被测试函数或测试程序计算的最终值的容差进行测试是错误的。测试的正确形式应该是这样的：

double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
   // Report error here.

总之，这给我们提供了一个比浮点舍入导致的任何可能差异更大的容差，因此它永远不会给我们一个误报（报告测试函数被破坏，但实际上并没有被破坏）。然而，与我们想要检测的错误引起的错误相比，它非常小，因此它很少会给我们漏报（未能报告实际错误）。

（请注意，计算容差时也存在舍入误差，但它们小于我在系数中使用 0.001 时允许的斜率，因此我们可以忽略它们。）

（另请注意! (difference < tolerance)不等于difference >= tolerance。如果该函数由于错误而产生 NaN，则任何比较都会产生 false：两者difference < tolerance and difference >= tolerance产生错误，但是! (difference < tolerance)产生 true。）

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

如何避免单元测试中的浮点舍入错误？的相关文章

隐式将 string 转换为 string_view

void Foo1 string view view string str one two three Foo1 one two three Implicitly convert char to string view Foo1 str I
为什么这些冲突出现在以下 XML 的 yacc 语法中

我有以下 XML 语法效果很好 program lt ID attribute list gt root root lt ID attribute list gt node list lt ID gt node list node s n
任何reinterpret_cast改变指针值的真实例子？

根据 C 标准 reinterpret cast一个指针的T 到其他类型的指针Q 可以改变或不改变指针值 https stackoverflow com questions 1863069 casting via void instead
是否有任何现成的组件可用于计算对象上的表达式？

我们想要解析以下类型的表达式 Func
如何使用c#/VB.NET在word中插入书签

我正在尝试使用 C 在 Word 文档中添加书签但它不起作用而且我在 msdn 文档和互联网上都找不到任何帮助这就是我正在尝试做的事情我正在阅读 Word 文档然后在该文档中搜索关键字然后将该文本转换为超链接效果很好现在我
计算序列而无法存储值？

问题陈述 here http www spoj com problems EC SER 令 S 为无限整数序列 S0 a S1 b Si Si 2 Si 1 对于所有 i gt 2 你有两个整数 a 和 b 您必须回答有关序列中第 n 个元
如果将其名称作为参数传递，如何在方法中打开表单

我正在尝试创建一个标准方法来根据传递给它的参数打开表单基本上要完成此任务 using Quotes newQte new Quotes newQte ShowDialog 通过替换 Quotes with a passed parame
pybind11 返回 numpy 对象数组

使用 pybind11 C API 和 python3 我们如何在 C 实现中正确创建一个 numpy 对象数组即 unicode 字符串并将其返回给 python 传递到 pybind11 array 的底层数据数组的确切内存布局是什
MPI_Gather 分段错误

我有这个并行高斯消除代码调用以下任一方法时会发生分段错误MPI Gather函数调用我知道如果没有为任一缓冲区正确分配内存可能会出现此类错误但我看不出内存管理代码有什么问题有人可以帮忙吗 Thanks Notes 该程序从一个 t
DirectX Vertex 中的 THE 是什么

我知道 RHW 是倒数同质 W 但有人可以解释一下它的使用方法和作用吗 gamedev论坛上的说明post http www gamedev net topic 440283 reciprocal of homogeneous w and
找到两个值的平均值的正确方法是什么？

我最近了解到整数溢出是 C 中的未定义行为附带问题 C 中也是 UB 吗在 C 编程中您通常需要求两个值的平均值a and b 然而做 a b 2可能会导致溢出和未定义的行为所以我的问题是找到两个值的平均值的正确方法是什么a an
如何将日期格式从 DD/MM/YYYY 或 MM/DD/YYYY 更改为 YYYY-MM-DD？

我需要使用 C 更改日期字符串的格式来自 06 16 2010 或 16 06 2010 至 2010 06 16 你能帮我实现这个目标吗 thanks 如果您已经将其作为DateTime use string x dt ToString
MonoMac 窗口关闭时没有错误

我刚刚开始在 Xamarin Studio 中使用 MonoMac 并且遇到了最奇怪的问题我有一个带有 NSButton 和 NSTextField 的窗口至此我已经删除了按钮上的事件处理程序因此它不会执行任何操作除了在单击它时突
如果未先将 lambda 表达式强制转换为委托或表达式树类型，则无法将其用作动态分派操作的参数

我正在使用 NET4 5 和 VS2013 我有这个查询dynamic来自数据库的结果 dynamic topAgents this dataContext Sql select t create user id as User sum t
是否可以在 ASP.NET Web API 和 SPA 中使用基于 cookie 的身份验证？

我想创建基于 angularjs 前端和 ASP NET Web API 的 Web 应用程序我需要创建安全 api 但我无法在将实施此 Web 应用程序的公司服务器上使用基于令牌的身份验证是否可以对 SPA 和 ASP NET Web
timeval_subtract 解释

使用 timeval subtract 函数来查找两个 struct timeval 类型之间经过的时间有人可以解释一下用于通过更新 y 执行后续减法的进位和其他部分的目的和逐步数学吗我了解该函数的目的以及如何在程序中实现它但我想
Windows 中的蓝牙 AVRCP 命令会触发哪些事件

可以这么说只是在做一些高级侦察对于我的潘多拉客户 Elpis http elpis adamhaile net 我支持全局媒体键键盘上的 MediaPlayPause MediaNext 等并且我希望能够支持AVRCP http e
隐藏 MediaPlayer 控件（Microsoft 媒体平台播放器框架）

我在 c xaml 应用程序中使用 MMP PF 提供我自己的控制元素来处理播放器这就是为什么我想隐藏禁用出现在底部的本机控件在屏幕截图的屏幕中间这只是使用了一个主题有人知道该怎么做吗我没能找到合适的房产像这样使用 axWin
如何获取 (Linux) 机器的 IP 地址？

这个问题和之前问的几乎一样如何获取本地计算机的IP地址 https stackoverflow com questions 122208 get the ip address of local computer 问题但是我需要找到一个的I
printf 右对齐括号内的数字

我正在编写一个程序显示数组中的所有信息它必须以括号中的数组索引开头例如 2 并且它们必须彼此正确对齐如果只是数字我知道你可以这样做 printf 10d index 但是用括号括起来会得到以下输出 1 2 10 11 当我真正希望

随机推荐

AutoMapper 地图子属性也定义了地图

我有以下域对象 public class DomainClass public int Id get set public string A get set public string B get set 我有以下两个要映射到的对象 pub
send() 函数返回的字节数多于 C++ 所需的字节数

我正在做一个套接字程序在我的服务器与设备连接后我试图向他发送一条消息但 send 函数返回的字节数大于数组中存储的字节数并且消息没有被发送这是我的代码 StartSendingMessages int retorno CStrin
是否有任何 jQuery 版本符合 Promise/A 规范？

在阅读了几篇文章之后我开始知道 jQuery 中存在 Promise 实现但我不确定 jQuery 的任何版本是否兼容 Promise A 2015 更新 jQuery 3 0 与 Promises A 兼容看这个问题在 GitHub
SocketCluster 中间件握手与承诺

我正在构建一个同时服务 http 和 ws 的应用程序用户首先通过 HTTP 登录 Laravel 服务器这会返回一个 JWT 用于允许通过 WS 登录 Ihv 添加了一个 MIDDLEWARE HANDSHAKE 来获取令牌并向 La
Theano 中的 numpy.matmul

TL DR我想复制的功能numpy matmul in theano 最好的方法是什么过短不明白看着theano tensor dot and theano tensor tensordot 我没有看到一种简单的方法来进行简单的批量矩阵
iframe src 在按钮单击时更改

我想在单击按钮后更改 iframe 的 src 我找不到解决办法
在 Windows 7 x86 上安装 Thin 时出现问题

我在获取时遇到问题thin http code macournoyer com thin 在我的 Windows 7 机器上工作我已经安装了 eventmachine v0 8 1 gt gem install Thin 忽略依赖项检查
如何将流数据写入Kafka？

我正在尝试对主题数据进行一些丰富因此使用 Spark 结构化流从 Kafka 接收器读回 Kafka val ds spark readStream format kafka option kafka bootstrap servers
如何使用 Common Lisp 获得列表的所有可能排列？

我正在尝试编写一个 Common Lisp 函数该函数将给出列表的所有可能排列每个元素仅使用一次例如列表 1 2 3 将给出输出 1 2 3 1 3 2 2 1 3 2 3 1 3 1 2 3 2 1 我已经写过一些有用的东西但它
如何使用 JNA 运行 chrome？

我写了一些java代码如何在 Windows 32 位中使用 JNA 运行 chrome 然后我喜欢了解它的含义如您所知 FindWindow 是一个简单的解决方案但如果 chrome 不运行它就不起作用查找窗口示例 http
通过一个报告用户运行所有 SSRS 报告，忽略自己的用户域

我有以下代码它从 SSRS 服务器返回报告然后将路径存储到每个单独的列表允许用户从应用程序内运行它们下面的工作正常 NetworkCredential serviceCredentials new NetworkCredential
Destroy_with_password 始终返回 false

以现有问题为基础演练如何需要密码才能删除用户帐户 https stackoverflow com questions 39373655 ruby on rails devise require password to delete acco
如何在隐藏“display: none;”时渲染传单地图家长

在我的页面上显示传单地图时我遇到奇怪的行为通常情况下地图会按预期渲染并且运行良好但是我只想在我在 javascript 中检测到的表单中发生错误时才显示地图所以如果我设置父级 div to display none 并根据需要稍
无法将数据移出互斥体

考虑下面的代码示例我有一个向量JoinHandlers我需要它迭代以连接回主线程但是这样做后我收到错误error cannot move out of borrowed content let threads Arc new Mute
使用 Internet Explorer 8 进行提示（）

我很难找到解决我的问题的方法这是一个代码片段 var ans prompt Mot de passe if ans ans null doPostBack Page ans else window location Erreurs Not
如何在 npm 中升级全局包的依赖项

我已经全局安装了pouchdb server我收到了这条消息graceful fs npm install g pouchdb server npm WARN deprecated email protected cdn cgi l ema
修改 NumPy 数组的特定行/列

如何修改 NumPy 数组的特定行或列例如我有一个 NumPy 数组如下所示 P array 1 2 3 4 5 6 如何更改第一行的元素 1 2 3 to 7 8 9 所以这样P会变成 P array 7 8 9 4 5 6 同样
Java SimpleDateFormat 解析时区，如 America/Los_Angeles

我想用Java解析以下字符串并将其转换为日期 DTSTART TZID America Los Angeles 20140423T120000 我试过这个 SimpleDateFormat sdf new SimpleDateFormat
用户登录后调用方法

我想知道用户登录后是否可以调用函数这是我要调用的代码 point this gt container gt get process points point gt ProcessPoints 1 this gt container 您可以
如何避免单元测试中的浮点舍入错误？

我正在尝试为一些对单精度浮点数数组进行操作的简单向量数学函数编写单元测试这些函数使用 SSE 内在函数并且在 32 位系统上运行测试时出现误报至少我认为测试在 64 位上通过当操作遍历数组时我积累了越来越多的舍入误差这是单元测

如何避免单元测试中的浮点舍入错误？

如何避免单元测试中的浮点舍入错误？ 的相关文章

随机推荐

热门标签

如何避免单元测试中的浮点舍入错误？的相关文章