I give a formula in my book—see pp. 78-79—but if you're looking for a simple one, the point at which the probability of some hash collision reaches about 50% in an n-bit hash is when you hash roughly 2n/2 keys. The SHA-1 hash itself is 160 bits, represented as 40 hexadecimal digits, each representing 4 of the 160 bits. Truncating that to 7 hexadecimal digits leaves 28 bits, so you will reach 50%-chance-of-collision at about 214 keys, or 16384 objects. If you constrain the objects to be only commits, that's a pretty decent number of commits, but Git places all objects—commits, trees, annotated tag objects, and blobs—in a single hash-indexed key-value store.
The probability of the hashes of any given pair of keys colliding is just 1 in 2n, i.e., 1 in 228 or 1 out of 268 million. The reason it increases so fast to 50%, as the number of keys grows, is known as the Birthday Paradox or birthday problem. 50% is of course far too scary; with 28 bits, if we want the overall probability to be below 0.1%, we should keep the number of objects below about 1230. By going to 32 bits (8 character abbrevations) we double this to about 2460, but that's still not very many objects.
By the time you have 16k objects in your store, you probably should use at least 10 hexadecimal digits, giving 240 possible hash values and a p-bar value of about .99987794... (about .019% chance of collisions). Nine hex digits gives only 236 hash values, producing a p-bar of .99804890... or 0.19% chance of collision, which I think is too high.
如果您可以将不明确匹配的代码限制为仅提交或仅提交式,在 Git 中意味着提交或带注释的标签——内置的默认设置运行得很好。 (事实上,Git 在很多情况下都会这样做。)但是,至少在我看来,Git 用于计算“正确”缩写长度的内部代码也太随意了。“松散的”,因为它在结果哈希可能用于识别的上下文中使用 50% 碰撞概率平方根技巧any object.
(正如评论中指出的,内部Git 始终使用完整的哈希值。它仅在非 Git / Git 接口上,例如,git log <hash>
or git show <hash>
面向用户的命令,您可以输入缩写的哈希值,或要求缩写的输出哈希值。这里 Git 将默认使用 50% 碰撞概率数字来计算要显示的字符数,首先估计数据库中的对象数量。如果您提供哈希值,you选择供应量。如果您要求 Git 提供它,您仍然可以选择多少,使用--abbrev=number
。请注意,绝对最小值为 4:git log abc
不会治疗abc
作为哈希 ID,但是git log abcd
会治疗abcd
作为哈希 ID 的缩写。还有一个非常古老的默认值,即 7 个字符,来自 Git 1.7 左右的时代。)