Markus Kuhn | 31 May 2007 20:23
Picon
Picon
Favicon

Conversion-free switching between binary and character strings in Perl

Let's say I live in a completely ISO 8859/etc.-free world, that I don't
care about the existance of any other character representation than
UTF-8, and that I am therefore absolutely not interested in any form of
character encoding conversion function.

How can I then switch between a "byte string" and a "character string"
in Perl without ever actually touching the stored bytes of the string?
All I want to change is the UTF-8 flag associated with a string that
tells the regular expression engine, for example, whether /./ matches
just a single byte or an entire UTF-8 character?

It seems the low-level Perl functions utf8::upgrade(),
utf8::downgrade(), utf8::encode(), and utf8::decode() (see "man 3 utf8")
are not usable, because they interpret and convert any binary string as
if it was an ISO 8859-1 string. I don't want to load any huge encoding
packages such as "use encode 'utf8';" or "use Encoding;", because I
don't need and want any character encoding conversion functions. All I
want to change is a simple flag. Unfortunately, the documentation is far
from clear on how to do this, and my experimentation leads to strange
results that look like strings going through several ISO 8859-1 to UTF-8
conversion steps (whereas I want zero of these).

Any help?

Markus

--

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain

(Continue reading)

Egmont Koblinger | 31 May 2007 20:35
Picon

Re: Conversion-free switching between binary and character strings in Perl

> How can I then switch between a "byte string" and a "character string"

I guess you're looking for Encode::_utf8_{on,off}

--

-- 
Egmont

Markus Kuhn | 31 May 2007 21:43
Picon
Picon
Favicon

Re: Conversion-free switching between binary and character strings in Perl

Egmont Koblinger wrote on 2007-05-31 18:35 UTC:
> > How can I then switch between a "byte string" and a "character string"
> 
> I guess you're looking for Encode::_utf8_{on,off}

Looks good, but can't get this to work either:

#!/usr/bin/perl
use Encode;
$s = pack("C2", 0xc2, 0xa9); # binary string containing COPYRIGHT SIGN
print "length=", length($s),"\n"; # gives 2
print "utf8=", Encode::is_utf8($s),"\n"; # gives false
# Convert non-ASCII UTF-8 into XML numeric character reference
$s =~
s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/Encode::_utf8_on($1),sprintf("&#x%02X;", ord($1))/ge;
print "$s\n"; # we want to see here: ©

$ ./test.pl
length=2
utf8=
Â

Is there something special about $1 inside a s/.../.../ge expression
that prevents the application of Encode::_utf8_on($1)?

Seems so, since

$s =~ s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a =
$1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;

(Continue reading)

Larry Wall | 31 May 2007 22:22
Picon
Gravatar

Re: Conversion-free switching between binary and character strings in Perl

On Thu, May 31, 2007 at 08:43:13PM +0100, Markus Kuhn wrote:
: Is there something special about $1 inside a s/.../.../ge expression
: that prevents the application of Encode::_utf8_on($1)?
: 
: Seems so, since
: 
: $s =~ s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a =
$1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;
: 
: does the trick.

Yes, in Perl 5 a magical variable like $1 is essentially a tied
reference into the middle of another string, and not a real value
in its own right, so when you read its value it copies out the
substring and ignores any flags you might have set on the original
scalar variable, since it thinks $1 is a read-only variable.  (And,
in fact, assigning to $1 complains about what it sees as an attempt
to modify a read-only variable, but _utf8_on() is not checking to
see if the scalar is considered writeable.)  But if it didn't simply
ignore the flag when copying out the value, you will have succeeded
in setting the utf8 flag for *all* $1 in your program, because Perl 5
only has one global $1 variable that interrogates the "current match"
every time you read it.

In theory this should all work better in Perl 6, where match variables
are properly lexically scoped, and $1 is just an alias into the list of
matches contained in the current match variable, so the identity of
each match can be preserved.  (Along with the fact that Perl 6 treats
byte strings and character strings as fundamentally different types
that must not be confused with each other.)
(Continue reading)


Gmane