utf-8 no BOM

Questions on using RText should go here.

Moderator: robert

utf-8 no BOM

Postby Adam » Fri May 04, 2012 10:12 am

how to open a file which encoding is utf-8 no Bom
Adam
 

Re: utf-8 no BOM

Postby Adam » Fri May 04, 2012 10:25 am

i know how to do it now.
if only it can recognize automatically!
Adam
 

Re: utf-8 no BOM

Postby robert » Fri May 04, 2012 2:18 pm

Hi Adam,

Yes, RText only recognizes Unicode files as such if they contain a BOM. Recognizing a UTF-8 file without a BOM would be a little tricky, it involves heuristics, I think the idea is something like this:

  • Scan of the file's contents searching for non-7-bit ASCII chars. Searching a sizable chunk of the file is likely sufficient if the file is huge (maybe 32K or 64K?)
  • If only 7-bit ASCII is found, assume ASCII (or the system default encoding). If not, but only valid UTF-8 sequences are found, assume UTF-8. If invalid UTF-8 sequences are found, assume the system default encoding (?).

I have reservations about implementing this. What if a system default encoding is not 7-bit? Some (all?) Linuxes use UTF-8 I believe (in which case this is all moot, but still). I think the big question mark would be Asian locales - what are the system default encodings in China, Japan, Korea, etc., and how do they compare with ASCII, Windows ANSI, and UTF-8?

One pretty foolproof thing that could be done is, if no BOM is found, but the file type is HTML, search for a charset specified as part of the Content-Type, and if the file is XML, grab the encoding specified in the declaration. That would be helpful in these very common file types. If you'd like to see this implemented in a future release, please fill out a Feature Request on RText's SourceForge tracker.
User avatar
robert
 
Posts: 788
Joined: Sat May 10, 2008 5:16 pm

Re: utf-8 no BOM

Postby Adam » Sat May 05, 2012 2:26 am

I have added it.
Adam
 


Return to Help

Who is online

Users browsing this forum: No registered users and 0 guests

cron