If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.
1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant to the spec.
2. Incomplete (more important)
There are many charsets (such as some windows
charsets) that are not in the registry, but that are in *far* more
widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf
http://mail.apps.ietf.org/ietf/charsets/msg01510.html
2. Ill-defined registrations (crucial)
a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf.
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.htmlAs a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.
* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows vs Linux...), by programming language
[Java) or by version of programming language runtime (IBM vs Sun's Java),
or even by product (database version).
In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all. We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf.
http://icu.sourceforge.net/charts/charset/.
And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information. As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).
So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.