根据评论,我们看到正在测试的功能本质上是:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i]
, b
, C[i]
, and D[i]
都有类型float
。当引用单次迭代的数据时,我会使用a
, c
, and d
for A[i]
, C[i]
, and D[i]
.
下面是我们在测试此函数时可用于容错的分析。不过,首先我想指出,我们可以设计测试以确保不出现错误。我们可以选择的值A[i]
, b
, C[i]
, and D[i]
以便所有结果,无论是最终结果还是中间结果,都可以精确表示,并且不存在舍入误差。显然,这不会测试浮点运算,但这不是目标。目标是测试函数的代码:它是否执行计算所需函数的指令?只需选择能够揭示使用正确数据、加法、乘法或存储到正确位置的任何失败的值就足以揭示函数中的错误。我们相信硬件能够正确执行浮点运算,并且不会对此进行测试;我们只是想测试函数是否正确编写。为了实现这一点,我们可以,例如,设置b
为 2 的幂,A[i]
到各种小整数,并且C[i]
各种小整数乘以b
。如果需要,我可以更精确地详细说明这些值的限制。那么所有的结果都会是准确的,并且任何允许比较的公差的需要都会消失。
除此之外,让我们继续进行错误分析。
目标是发现功能实现中的错误。为此,我们可以忽略浮点运算中的小错误,因为我们正在寻找的错误类型几乎总是会导致大错误:使用了错误的操作,使用了错误的数据,或者结果没有存储在浮点运算中。期望的位置,因此实际结果几乎总是与预期结果有很大不同。
现在的问题是我们应该容忍多少错误?因为bug通常会导致较大的错误,所以我们可以将容差设置得相当高。然而,在浮点数中,“高”仍然是相对的。与数万亿的值相比,一百万的误差很小,但当输入值是数万亿时,它太大而无法发现错误。所以我们至少应该做一些分析来决定水平。
The function being tested will use SSE intrinsics. This means it will, for each i
in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c
do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1)
for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
接下来,我们考虑测试程序为其预期值产生什么值。它使用C代码d = a*b + c;
。 (我已将问题中的长名称转换为较短的名称。)理想情况下,这还会计算 IEEE-754 32 位二进制浮点中的乘法和加法。如果确实如此,那么结果将与正在测试的功能相同,并且不需要在比较中允许任何容差。然而,C 标准允许实现在执行浮点运算时具有一定的灵活性,并且存在不符合标准的实现,它们比标准允许的自由度更大。
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c
using double
or long double
arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b
is calculated with extra precision, yielding exactly a•b, because double
and long double
have enough precision to exactly represent the product of any two float
values. A C implementation might then round this result to float
. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c
is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float
for assignment to d
, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
请注意,表达式与最终值不成比例,a*b+c
,由正在测试的函数或测试程序计算得出。这意味着,一般来说,使用相对于被测试函数或测试程序计算的最终值的容差进行测试是错误的。测试的正确形式应该是这样的:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
总之,这给我们提供了一个比浮点舍入导致的任何可能差异更大的容差,因此它永远不会给我们一个误报(报告测试函数被破坏,但实际上并没有被破坏)。然而,与我们想要检测的错误引起的错误相比,它非常小,因此它很少会给我们漏报(未能报告实际错误)。
(请注意,计算容差时也存在舍入误差,但它们小于我在系数中使用 0.001 时允许的斜率,因此我们可以忽略它们。)
(另请注意! (difference < tolerance)
不等于difference >= tolerance
。如果该函数由于错误而产生 NaN,则任何比较都会产生 false:两者difference < tolerance
and difference >= tolerance
产生错误,但是! (difference < tolerance)
产生 true。)