count character frequency

Cool Smith

i need to examine a text file containing Arabic writings, by counting each character frequency of each character, how do i do this in vb.net? Do i need to detect the file code-page before reading? do i need to convert the file code-page before doing anything? AM using vb.net 2008

Tieske8

For my 2cts; I don't think there is any way to detect the codepage, you ought to know it before opening it. Text files don't have any metadata to tell you anything about its content. So if you know the encoding, you can load the full file by using System.IO.File.ReadAllText(path As String, encoding As Encoding) As String. From here you should have a properly encoded string and you can start counting characters.

If you want something done fast, then do it right (Grissom, CSI) Thanks for your reply, you just acknowledged my existence

Cool Smith

what i want to do is count the frequency of characters that appears in an Arabic text

Tieske8

Have you tried it? Or can you be more specific? A string in .NET is not just a list of bytes. Every character consists of 1 or more bytes depending on the encoding used. The method provided will read the file into a proper encoded string. All you have to do is traverse the string and count the characters.

If you want something done fast, then do it right (Grissom, CSI) Thanks for your reply, you just acknowledged my existence

Dave Kreskowiak

All you have to do is read the file into a String, then iterate over the Characters in the string and add them to a Dictionary collection. Since Dictionary is a key/value pair collection, the "key" will be the character you are looking at. The "value" will be the count of those characters. When you go to add the character to the collection, you first see if it is already there, and if so, get it's value and increment it by one. If not, add the new key with a value of 1. Move on the next character...

A guide to posting questions on CodeProject[^]
Dave Kreskowiak

Lost User

Cool Smith wrote:

i need to examine a text file containing Arabic writings

What format is it in? Ideally, it'd be UTF. It's important since the encoding determines the length of a single character. Download a HEX-editor and open the textfile with it - what do the first bytes look like in HEX?

Cool Smith wrote:

Do i need to detect the file code-page before reading?

There's no way of detecting it with good precision, but Notepad can tale an educated guess[^]. If you have any say in it, then it should be UTF. If you don't, ask which codepage was used to write the files. There'll be a difference in Windows Arabic 1256[^] and DOS Arabic 864[^]

Cool Smith wrote:

by counting each character frequency of each character, how do i do this in vb.net?

First, determine the encoding, and read the file with that encoding. Then create a dictionary, read the entire file as a string. Loop through the string by eating characters, adding them to the dictionary as the key, or adding +1 to it's value if it's already in the dictionary. When done eating, burp out the results :)

I are Troll :suss:

Cool Smith

What format is it in? Ideally, it'd be UTF. It's important since the encoding determines the length of a single character. Download a HEX-editor and open the textfile with it - what do the first bytes look like in HEX?

The this is, the software will be examining different text files (*.txt) only that contains arabic writings. i found code here that can detect the code page of a file and another that can convert between different code page.

First, determine the encoding, and read the file with that encoding. Then create a dictionary, read the entire file as a string. Loop through the string by eating characters, adding them to the dictionary as the key, or adding +1 to it's value if it's already in the dictionary. When done eating, burp out the results

can you give me pseudo code for this, i don't have any idea how to do it

Lost User

Cool Smith wrote:

i found code here that can detect the code page of a file

Can you post a link to that article? I haven't read it yet :)

Cool Smith wrote:

can you give me pseudo code for this

It'd go something like this;

// A dictionary, used to count the frequencies
Dictionary characterCounter = new Dictionary();

// we'll read the entire file into a string;
string theFile = File.ReadAllText("C:\test.txt");

// we'll keep removing characters and process them, until the string is empty
while (theFile.Length > 0)
{
// get the char at the end of the string
string CurrentCharacter = theFile[theFile.Length -1];

// remove that thing from the string that holds the file
string theFile = theFile.Remove(theFile.Length -1, 1);

// if the dictionary contains our character
if (characterCounter.ContainsKey(CurrentCharacter))
{
// increase the value of the int
characterCounter[CurrentCharacter] = characterCounter[CurrentCharacter] + 1;
}
else
{
// it wasn't in the dictionary yet, so it must be the
// first time that we encounter this character. Add it;
characterCounter.Add(CurrentCharacter, 1);
}
}

// done with counting, now show the results to the user
for each (DictionaryEntry entry in characterCounter)
{
textBox1.Text += String.Format("char {0} occurs {1} times", entry.Key, entry.Value);
}

This could be a bit slow with large files, as it forces .NET to allocate memory each time for a new string. It'd be more efficient if it were a moving frame. That'd go something more like this;

string theFile = File.ReadAllText("C:\test.txt");

// this will point to the index of the character that we're processing
Int64 currentPos = 0;
Int64 endPos = theFile.Length -1;

// while the current position in the string doesn't match the end position;
while (currentPos <> endPos)
{
// fetch the current character from the string, at the current index
string CurrentCharacter = theFile[currentPos];

// increase the index
currentPos = currentPos + 1;

// rest of dictionary-code here;
...

}

I are Troll :suss:

Cool Smith

here are the links CodePage File Converter[^] Detect Encoding for In- and Outgoing Text[^] i'll try your implementation and and back to you. besides i found a hextostring code, will it work well for recognizing single characters in a joined character Private Function ConvertStringToHex(ByVal MyString As String) As String Dim Result As String = vbNullString If Len(MyString) = 0 Then Result = vbNullString Else For i As Integer = 0 To Len(MyString.Trim) - 1 Dim MyChar As String = Mid(MyString.Trim, i + 1, 1) Result = Result + Xformat(Hex(Microsoft.VisualBasic.AscW(MyChar))) Next End If Return Result End Function Private Function ConvertHexToString(ByVal MyString As String) As String Dim Result As String = vbNullString If Len(MyString) = 0 Then Result = vbNullString Else For i As Integer = 0 To Len(MyString.Trim) - 1 Step 4 Dim MyChar As String = Mid(MyString.Trim, i + 1, 4) Result = Result + Microsoft.VisualBasic.ChrW(Convert.ToInt32(MyChar, 16)) Next End If Return Result End Function Function Xformat(ByVal xin As String) As String Dim retval As String = xin Select Case Len(xin) Case Is = 3 retval = "0" & xin Case Is = 2 retval = "00" & xin Case Is = 1 retval = "000" & xin End Select Return retval End Function End Class

Cool Smith

first am using vb.net not c#, i tried convertin to vb.net using http://www.developerfusion.com/tools/convert/csharp-to-vb/[^], and i get many errors. Can you provide vb.net version?

Lost User

Cool Smith wrote:

first am using vb.net not c#, i tried convertin to vb.net

You asked for pseudocode, and that's what it is.

Cool Smith wrote:

Can you provide vb.net version?

No, since it's not my job. You could post your code however, and people could have a look. That is, if you explain where you're stuck.

I are Troll :suss: