compression
-
hello i have a file with only entry of 0..9 (numbers only) without any other thing even space , what is the most efficient algorithm to compress it? thanks in advance.
at the command level, use a ZIP utility, such as WinZip. When programming an application, create a binary file. Assuming all digit values have same probability, you could: - create a binary file with
BinaryWriter
class; it will be filled withulong
values; - the first value to write would be the number of digits to be stored; - then have a loop that reads 19 digits, turns them into a single ulong (useulong.Parse
or TryParse), and outputs that ulong in binary; - when reaching the end of the input data, which probably did not have a length evenly divisible by 19, just imagine the few missing digits are all zero, and generate the last ulong as you did all the others. That is it; you have turned N characters (probably bytes in an ASCII text file) into a sequence of N/19*8 bytes, giving a compression factor of 2.37 You could do better if something was known about the distribution of the digits; otherwise that is about the best you could get. :)Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
-
at the command level, use a ZIP utility, such as WinZip. When programming an application, create a binary file. Assuming all digit values have same probability, you could: - create a binary file with
BinaryWriter
class; it will be filled withulong
values; - the first value to write would be the number of digits to be stored; - then have a loop that reads 19 digits, turns them into a single ulong (useulong.Parse
or TryParse), and outputs that ulong in binary; - when reaching the end of the input data, which probably did not have a length evenly divisible by 19, just imagine the few missing digits are all zero, and generate the last ulong as you did all the others. That is it; you have turned N characters (probably bytes in an ASCII text file) into a sequence of N/19*8 bytes, giving a compression factor of 2.37 You could do better if something was known about the distribution of the digits; otherwise that is about the best you could get. :)Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
No, a compression factor of 2.37 for that purpose is not good at all. Just send a text file to a "zip compressed" folder in Windows explorer, and you get a compression ratio of about 10. And, of course, the text file contains not only digits but other ASCII characters also, which means that it already contains much more information per byte of storage, than does the file with digits only. Hence I'd take some common zip algorithm. The "ICSharpZipLib" is also available for commercial products (see http://www.sharpdevelop.net/OpenSource/SharpZipLib/[^]). In the Unix world, I sometimes see "bz2" compressed files which are said to be extremely compressed; but I do not know a library to handle that with .NET.
-
No, a compression factor of 2.37 for that purpose is not good at all. Just send a text file to a "zip compressed" folder in Windows explorer, and you get a compression ratio of about 10. And, of course, the text file contains not only digits but other ASCII characters also, which means that it already contains much more information per byte of storage, than does the file with digits only. Hence I'd take some common zip algorithm. The "ICSharpZipLib" is also available for commercial products (see http://www.sharpdevelop.net/OpenSource/SharpZipLib/[^]). In the Unix world, I sometimes see "bz2" compressed files which are said to be extremely compressed; but I do not know a library to handle that with .NET.
Incorrect. You either believe in magic, or you didn't understand my suggestion. So I created a little proof:
public void generateFiles() {
using (TextWriter tw=File.CreateText("test.txt")) {
using (BinaryWriter bw=new BinaryWriter(File.Create("test.dat"))) {
int DIM=100*1000;
Random r=new Random();
for (int i=0; i<DIM; i++) {
ulong n=0;
for (int j=0; j<19; j++) {
int d=r.Next(10);
tw.Write("01234567890".Substring(d,1));
n=10*n+(ulong)d;
}
bw.Write(n);
}
}
}
}and I WinZipped the txt file; the resulting sizes were:
.txt = 1,856 KB
.dat = 782 KB (that is a compression factor of 2.37 indeed)
.zip = 872 KB (that is 2.13)which tells me WinZip is wasting around 10%. While ZIP normally achieves around a factor of 10, that simply can not be achieved when the data is really random and has log2(10) significant bits in each 8-bit byte (if it could, it would store a digit in less than one bit which is clearly impossible) :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
-
Incorrect. You either believe in magic, or you didn't understand my suggestion. So I created a little proof:
public void generateFiles() {
using (TextWriter tw=File.CreateText("test.txt")) {
using (BinaryWriter bw=new BinaryWriter(File.Create("test.dat"))) {
int DIM=100*1000;
Random r=new Random();
for (int i=0; i<DIM; i++) {
ulong n=0;
for (int j=0; j<19; j++) {
int d=r.Next(10);
tw.Write("01234567890".Substring(d,1));
n=10*n+(ulong)d;
}
bw.Write(n);
}
}
}
}and I WinZipped the txt file; the resulting sizes were:
.txt = 1,856 KB
.dat = 782 KB (that is a compression factor of 2.37 indeed)
.zip = 872 KB (that is 2.13)which tells me WinZip is wasting around 10%. While ZIP normally achieves around a factor of 10, that simply can not be achieved when the data is really random and has log2(10) significant bits in each 8-bit byte (if it could, it would store a digit in less than one bit which is clearly impossible) :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
That is, the high compression rate of Windows zip with texts like log files is caused by the repetitive structure of the contents, though some 6 out of 8 bits per byte are used for the text. With log2(10) significant bits per byte, the theoretical limit should then be a compression factor of 2.408, thus your idea already achieves some 98.5% of it. Let's see how much more number crunching and execution time will be required to get it one percent better. :)
-
That is, the high compression rate of Windows zip with texts like log files is caused by the repetitive structure of the contents, though some 6 out of 8 bits per byte are used for the text. With log2(10) significant bits per byte, the theoretical limit should then be a compression factor of 2.408, thus your idea already achieves some 98.5% of it. Let's see how much more number crunching and execution time will be required to get it one percent better. :)
one could perform my method with a wider integer; with a 256-bit integer the compression factor would be 77/32=2.406 The BigInteger class could be helpful to achieve that and more. I didn't mention it earlier as I don't care about the last few percents. :)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.