Binary or Non-Binary File

DotNetDominator

Is there a way in C# or C++ to determine if given file is Binary or Non-Binary. There are some forums which suggests to check the bytes of file and look for null byte. Is there any other way around? Any help would be appriciated.

Judah Gabriel Himango

Perhaps this API will help: IsTextUnicode[^]

Tech, life, family, faith: Give me a visit. I'm currently blogging about: Repentance The apostle Paul, modernly speaking: Epistles of Paul Judah Himango

Luc Pattyn

Hi, all files are binary, some contain text (in ASCII, Unicode, whatever), some contain an image, or some other kind of data. So you might have to clarify your question. :)

Luc Pattyn [My Articles] [Forum Guidelines]

DotNetDominator

Of course, all files are Binary, but i want to differentiate files based on the printable characters they contain. Basically, i need it for a utility which would compare two files and write the differences between them to third file, or may update one file by comparing it to the others. I can only tell this much. Since, such comparision for "binary" files like DLL, Jar etc are meaningless i wanted to identify them before i compare them. I can't change the utility i will use for such comparision. I wrote following method, which i think would work fine. Do you think it would work across all character sets? I am just reading file byte by byte and looking for a byte which is zero. Then i know that the file is binary. static bool isBinary(ref BinaryReader binaryReader) { bool nullByteFound = false; int i = 0; byte unsignedByte; while (i < binaryReader.BaseStream.Length) { unsignedByte = binaryReader.ReadByte(); if (unsignedByte == 0){ nullByteFound = true; break; } i++; } Console.WriteLine("Bull= " + nullByteFound); return nullByteFound; } The other API IsTextUnicode may also help in solving problem if i retrieve IS_TEXT_UNICODE_NULL_BYTES flag. Thanks all for your help on this.

Luc Pattyn

Hi, if a text file is encoded using ASCII or ANSI or some other 8-bit character set, then zero-testing looks acceptable. if a text file is encoded using some 16-bit encoding scheme, then zero bytes can occur in text files (e.g. the char 0x0100, 0x0200, etc). You could check the first few bytes of the file, Unicode/UTF8/UTF16 use special values here; if these match you might assume it is text and skip further testing (and once in a while such assumption will be wrong); if they dont match you could assume it is an 8-bit encoding, and do the zero test. Whatever you do, since 100% confidence will not be achievable, I see no point in checking more than a few hundred bytes before deciding text/no text. :)

Luc Pattyn [My Articles] [Forum Guidelines]