Yes, RText only recognizes Unicode files as such if they contain a BOM. Recognizing a UTF-8 file without a BOM would be a little tricky, it involves heuristics, I think the idea is something like this:
- Scan of the file's contents searching for non-7-bit ASCII chars. Searching a sizable chunk of the file is likely sufficient if the file is huge (maybe 32K or 64K?)
- If only 7-bit ASCII is found, assume ASCII (or the system default encoding). If not, but only valid UTF-8 sequences are found, assume UTF-8. If invalid UTF-8 sequences are found, assume the system default encoding (?).
I have reservations about implementing this. What if a system default encoding is not 7-bit? Some (all?) Linuxes use UTF-8 I believe (in which case this is all moot, but still). I think the big question mark would be Asian locales - what are the system default encodings in China, Japan, Korea, etc., and how do they compare with ASCII, Windows ANSI, and UTF-8?
One pretty foolproof thing that could be done is, if no BOM is found, but the file type is HTML, search for a charset specified as part of the Content-Type, and if the file is XML, grab the encoding specified in the declaration. That would be helpful in these very common file types. If you'd like to see this implemented in a future release, please fill out a Feature Request on RText's SourceForge tracker