How to detect UTF-8-based encoded strings
-
Hi A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying. So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort.nl/~t876506/utf8tbl.html Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societé" (Society in english) would return FALSE through calling isUTF8("Societé") while it would return TRUE when calling isUTF8("SocietÈ") since È is the UTF-8 encoded representation of é. Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post. [code] Function isUTF8(ByVal ptstr As String) Dim tUTFencoded As String Dim tUTFencodedaux Dim tUTFencodedASCII As String Dim ptstrASCII As String Dim iaux, iaux2 As Integer Dim ffound As Boolean ffound = False ptstrASCII = "" For iaux = 1 To Len(ptstr) ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|" Next tUTFencoded = "Ä|Ã…|Ç|É|Ñ|Ö|ÃŒ|á|Ã|â|ä|ã|Ã¥|ç|é|è|ê|ë|Ã|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü|â€|°|¢|£|§|•|¶|ß|®|©|â„¢|´|¨|â‰|Æ|Ø|∞|±|≤|≥|Â¥|µ|∂|∑|âˆ|Ï€|∫|ª|º|Ω|æ|ø|¿|¡|¬|√|Æ’|≈|∆|«|»|…|Â|À|Ã|Õ|Å’|Å“|–|—|“|â€|‘|’|÷|â—Š|ÿ|Ÿ|â„|€|‹|›|ï¬|fl|‡|·|‚|„|‰|Â|Ú|Ã|Ë|È|Ã|ÃŽ|Ã|ÃŒ|Ó|Ô||Ã’|Ú|Û|Ù|ı|ˆ|Ëœ|¯|˘|Ë™|Ëš|¸|Ë|Ë›|ˇ" & _
-
Hi A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying. So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort.nl/~t876506/utf8tbl.html Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societé" (Society in english) would return FALSE through calling isUTF8("Societé") while it would return TRUE when calling isUTF8("SocietÈ") since È is the UTF-8 encoded representation of é. Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post. [code] Function isUTF8(ByVal ptstr As String) Dim tUTFencoded As String Dim tUTFencodedaux Dim tUTFencodedASCII As String Dim ptstrASCII As String Dim iaux, iaux2 As Integer Dim ffound As Boolean ffound = False ptstrASCII = "" For iaux = 1 To Len(ptstr) ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|" Next tUTFencoded = "Ä|Ã…|Ç|É|Ñ|Ö|ÃŒ|á|Ã|â|ä|ã|Ã¥|ç|é|è|ê|ë|Ã|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü|â€|°|¢|£|§|•|¶|ß|®|©|â„¢|´|¨|â‰|Æ|Ø|∞|±|≤|≥|Â¥|µ|∂|∑|âˆ|Ï€|∫|ª|º|Ω|æ|ø|¿|¡|¬|√|Æ’|≈|∆|«|»|…|Â|À|Ã|Õ|Å’|Å“|–|—|“|â€|‘|’|÷|â—Š|ÿ|Ÿ|â„|€|‹|›|ï¬|fl|‡|·|‚|„|‰|Â|Ú|Ã|Ë|È|Ã|ÃŽ|Ã|ÃŒ|Ó|Ô||Ã’|Ú|Û|Ù|ı|ˆ|Ëœ|¯|˘|Ë™|Ëš|¸|Ë|Ë›|ˇ" & _
This forum is for asking questions - hence the big red button labelled "Ask a Question". Your post looks more like it should be a tip/trick[^]. Also, posting your email address in a public forum is a great way to have your inbox flooded with spam.
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
This forum is for asking questions - hence the big red button labelled "Ask a Question". Your post looks more like it should be a tip/trick[^]. Also, posting your email address in a public forum is a great way to have your inbox flooded with spam.
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
Hi Richard, maybe we didn't know there was such a tip/tricks section?
-
Hi Richard, maybe we didn't know there was such a tip/tricks section?
-
Great, I've posted it there now. Didn't know such a section existed in the site
-
Great, I've posted it there now. Didn't know such a section existed in the site