我的主要问题是重新打包存储库是否可能对大型二进制文件产生任何有意义的影响。
这取决于它们的内容。对于您特别概述的文件:
我经常看到 .zip、tgz 和 .simg 文件。
Zipfiles 和 tgz(gzipped tar archive)文件已经被压缩并且具有可怕的(即高)香农熵 https://en.wikipedia.org/wiki/Entropy_(information_theory)值——这对 Git 来说是可怕的——并且不会相互压缩。这.simg
文件可能是(我必须在这里猜测)奇点磁盘映像文件 http://singularity.lbl.gov/docs-recipes;我不知道它们是否以及如何被压缩,但我假设它们是。 (一个简单的测试是将一个压缩器输入到压缩器,例如 gzip,然后看看它是否收缩。)
因此,我预计原始代码会有很大的重叠,但我不确定此时实际文件有多相似,因为我相信这些格式已经被压缩了,对吗?
恰恰。储存它们未压缩的因此,矛盾的是,Git 最终会导致更大的压缩。 (但打包可能需要大量内存。)
如果[这可能是徒劳的],我会按照建议跳过它们here https://stackoverflow.com/a/8686576.
这将是我来这里的第一个冲动。 :-)
我承认我不太理解链接问题上讨论的各种 git 选项。我也不太明白什么是--window
and --depth
旗帜正在做git repack
.
各种限制令人困惑(而且数量众多)。同样重要的是要认识到它们不会在克隆上被复制,因为它们位于.git/config
这不是已提交的文件,因此新的克隆不会拾取它们。这.gitattributes
file is在克隆上复制,新克隆将继续避免打包不可打包的文件,因此这是更好的方法。
(If you care to dive into the details, you will find some in the Git technical documentation https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt. This does not discuss precisely what the window sizes are about, but it has to do with how much memory Git uses to memory-map object data when selecting objects that might compress well against each other. There are two: one for each individual mmap on one pack file, and one for the total aggregate mmap on all pack files. Not mentioned on your link: core.deltaBaseCacheLimit
, which is how much memory will be used to hold delta bases—but to understand this you need to grok delta compression and delta chains,1 and read that same technical documentation. Note that Git will default to not attempting to pack any file object whose size exceeds core.bigFileThreshold
. The various pack.*
controls are a bit more complex: the packing is done multi-threaded to take advantage of all your CPUs if possible, and each thread can use a lot of memory. Limiting the number of threads limits total memory use: if one thread is going to use 256 MB, 8 threads are likely to use 8*256 = 2048 MB or 2 GB. The bitmaps mainly speed up fetching from busy servers.)
1They're not that complicated: a delta chain occurs when one object says "take object XYZ and apply these changes", but object XYZ itself says "take object PreXYZ and apply these changes". Object PreXYZ can also take another object, and so on. The delta base is the object at the bottom of this list.