Determining Language
-
I am writing an application in C++ to monitor chat from multilingual chat rooms. I would like to be able to determine what language the chat is in, as in some rooms only certain languages are allowed. I was thinking of using the unicode ranges to determine whether the characters of the chat are in the allowed set - but this become difficult with the asian languages as the are over 200 different language sets defined. Does anyone know of an easier method of determining what language is being used?
-
I am writing an application in C++ to monitor chat from multilingual chat rooms. I would like to be able to determine what language the chat is in, as in some rooms only certain languages are allowed. I was thinking of using the unicode ranges to determine whether the characters of the chat are in the allowed set - but this become difficult with the asian languages as the are over 200 different language sets defined. Does anyone know of an easier method of determining what language is being used?
Look up words in a tiny dictionary of commonly used language-specific words. Examples for English would be: I, you, he, she, it, a, the, have, has, am, are, is.... (you won't need more than a dozen or so per language) Making a mistake is of course possible, but if you scan input thoroughly enough chances are you'll determine the language accurately. Of course, the issue must have been looked into by many philologists. Try searching the web. Technically, if you wish to avoid this method all you can do is try checking character codes unless you want to mess with checking the current keyboard layout. I've never tried the latter but it looks like headache and guarantees absolutely nothing...
-
Look up words in a tiny dictionary of commonly used language-specific words. Examples for English would be: I, you, he, she, it, a, the, have, has, am, are, is.... (you won't need more than a dozen or so per language) Making a mistake is of course possible, but if you scan input thoroughly enough chances are you'll determine the language accurately. Of course, the issue must have been looked into by many philologists. Try searching the web. Technically, if you wish to avoid this method all you can do is try checking character codes unless you want to mess with checking the current keyboard layout. I've never tried the latter but it looks like headache and guarantees absolutely nothing...
Thanks. I had thought of your first method, but I think certain languages use similar words so the accuracy wouldn't be that good especially for chat as maybe only one of the words I am searching for would be used per chat line. Haven't had much luck on the web either :(