由 40 个不同字符组成的 400 个字符可以压缩多少?

2024-01-12

我有一个 400 个字符的字符串,由 40 个可能的字符组成。这40个字符是a-z A-N。在这个例子中我有:

nnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnuuunnnuxunnnnnnnnnnuxddnnnuxduuuuuunnnnuxddnnnuxddddddunnnnuxddnuuuxdddddduuuuuuxddnnnuxduuudducuccuxddnnnuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduiuiiuxddnnnuxdddddducuccuxddnnnuxdddddduiuiiuxddnnnuxduuudducuccuxddnuuuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduuuuuuxddnnnuxxxxxxxunnnnuxxxnnnnuuuuuuuunnnnnuuunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

我需要使用 URL 安全字符制作压缩版本。我在用a-z A-N and 0-9 - . _ ~ ( ) ' ! * : @ . ;表示原始字符串中字符重复的次数,最多 23 次。如果某个字符重复超过 23 次,则该字符和数字表示形式将添加到先前压缩的字符串中。

压缩后,结果字符串为 246 个字符:

n;un-u1n1uxun8uxd0n1uxdu4n2uxd0n1uxd4un2uxd0nu1xd4u4xd0n1uxdu1d0ucuc0uxd0n1uxdunod0uiui0uxd0n1uxduo0d0ucuc0uxd0n1uxd4uiui0uxd0n1uxd4ucuc0uxd0n1uxd4uiui0uxd0n1uxdu1d0ucuc0uxd0nu1xdunod0uiui0uxd0n1uxduo0d0ucuc0uxd0n1uxd4u4xd0n1ux5un2ux1n2u6n3u1n;n(

压缩原始字符串的Javascript函数是:

// 40 single URL-safe characters 
let singleCharList = [
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"q",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z",
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
]

// 23 single URL-safe characters for counting (0 is actually 1)
let singleCharCountList = [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"-",
".",
"_",
"~",
"(",
")",
"'",
"!",
"*",
":",
"@",
",",
";",
]

function compressString(stringToCompress) {
  var compressedString = ""
  var lastChar = ""
  var inARowCount = 0
  var maxInARowCount = singleCharCountList.length
  for (i = 0; i < stringToCompress.length + 1; ++i) {
let currentChar = stringToCompress.charAt(i)
if (currentChar === lastChar && inARowCount < maxInARowCount) {
  inARowCount += 1
} else {
  if (inARowCount > 0) {
    singleCharCount = singleCharCountList[inARowCount - 1]
    compressedString = compressedString + lastChar + singleCharCount
  } else {
    compressedString = compressedString + lastChar
  }
  inARowCount = 0
}
lastChar = currentChar
  }
  return compressedString
}

// To inflate the string back to its original 400 characters, I have: 

function inflateString(stringToInflate) {
  var inflatedString = ""
  var lastChar = ""
  for (i = 0; i < stringToInflate.length; ++i) {
let currentChar = stringToInflate.charAt(i)
// currentChar is part of singleCharCountList 
// and lastChar is part of singleCharList
if (singleCharCountList.indexOf(currentChar) > -1 &&
  singleCharList.indexOf(lastChar) > -1) {
  var loop = singleCharCountList.indexOf(currentChar) + 1
  for (var j = 0; j < loop; ++j) {
    inflatedString = inflatedString + lastChar
  }
}
// currentChar is part of singleCharList
// and lastChar is part of singleCharCountList
else if (singleCharList.indexOf(currentChar) > -1 &&
  singleCharCountList.indexOf(lastChar) > -1) {
  inflatedString = inflatedString + currentChar
}
// currentChar is part of singleCharList
// and lastChar is part of singleCharList
else if (singleCharList.indexOf(currentChar) > -1 &&
  singleCharList.indexOf(lastChar) > -1) {
  inflatedString = inflatedString + currentChar
}
// current char is part of singleCharList
// and lastChar is not part of singleCharList or singleCharCountList
else if (singleCharList.indexOf(currentChar) > -1 &&
  singleCharList.indexOf(lastChar) == -1 &&
  singleCharCountList.indexOf(lastChar) == -1) {
  inflatedString = inflatedString + currentChar
}
lastChar = currentChar
  }
  return inflatedString
}


const str = 'nnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnuuunnnuxunnnnnnnnnnuxddnnnuxduuuuuunnnnuxddnnnuxddddddunnnnuxddnuuuxdddddduuuuuuxddnnnuxduuudducuccuxddnnnuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduiuiiuxddnnnuxdddddducuccuxddnnnuxdddddduiuiiuxddnnnuxduuudducuccuxddnuuuxdunodduiuiiuxddnnnuxduoodducuccuxddnnnuxdddddduuuuuuxddnnnuxxxxxxxunnnnuxxxnnnnuuuuuuuunnnnnuuunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn';

const compressed = compressString(str);
console.log(compressed.length);
const inflate = inflateString(compressed)
console.log(str.length,inflateString.length)

我可以做得更好吗?在我的示例中,我对 246 感到比较满意,但这可能会根据原始内容而有很大差异。


首先使用标准压缩工具(例如 zlib)压缩字符串。您的示例使用 zlib 从 400 字节压缩到 93 字节。然后将二进制结果编码为 75 个字符的限制集,这会将其扩展约 30%,从而产生约 120 个字节。与原来的 400 相比仍显着压缩。

这种编码可以通过多种方式完成,但最快的可能是反向霍夫曼编码,其中使用二进制来表示 75 个符号的平坦概率分布的霍夫曼代码序列。最终得到 53 个 6 位代码和 22 个 7 位代码。这给出了每个符号 6.29 位,接近每个符号的最佳 6.23 位。

开发自己的、不合标准的压缩方法,并且在游程编码方面的尝试很薄弱,这是浪费时间。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

由 40 个不同字符组成的 400 个字符可以压缩多少? 的相关文章

随机推荐