PDA

View Full Version : Mysterious behavior


CarlSeiler
09-24-2007, 06:44 PM
I've noticed recently that increasingly I'm finding that sites are displaying strange stray characters. In the past, I could usually chalk this up to people dumping pages from desktop publishing or word processing programs that including such weird things as curly quotes, etc. Recently, however, I can't explain what I see.

Let's take this page from Scientific American (http://sciam.com/article.cfm?chanID=sa013&articleID=7750A576-E7F2-99DF-3824E0B1C2540D47&pageNumber=3&catID=2). I'm reading along in Firefox under Linux and I see that for each end-quote, I see instead of the usual " mark I see a not-equals mark ≠ (not sure if that will show up but it's Unicode char 8800, I think). I think it's odd that they'd have this character for an end quote, so I look at it in Firefox and IE 6 under Windows, and guess what. There are no end quotes visible. I think, "What's going on here." So I take a peek at the source code, and there doesn't seem to be any end quotes in the source when I look at the HTML source in Windows. When I look at the source under Linux, it's as plain as day the ≠ character in the text file. Weird! The question is: am I the only one who has problems like this, and if not, why don't major web sites see their problems?

CarlSeiler
09-24-2007, 07:02 PM
I also noted in Firefox some upside down question marks ( ¿ )sprinkled throughout the article, including the middle of the word "formal" near the top of the first page of the article. To make it even weirder, this shows up as a Korean character in IE 6. I'll attach a few screen captures.

I don't think Windows XP and IE6 are all that uncommon, and neither is Firefox under Windows. What are they doing at these major sites that they can't see they're picking weird characters or formatting? Am I a gnat for asking?

iamback
09-24-2007, 10:41 PM
I've noticed recently that increasingly I'm finding that site are displaying strange stray characters. In the past, I could usually chalk this up to people dumping pages from desktop publishing or word processing programs that including such weird things as curly quotes, etc. Recently, however, I can't explain what I see.Looking at request and response headers, and the source of the page, I note that neither the page, nor the server, defines a character set. At all. Boo!

The result is that your browser will have to guess, and (most likely) assume that it's getting the content in what it has configured as the preferred character set.

For me, that's UTF-8, using Firefox - and I see no "strange" characters, but I do see missing end quotes. I'm guessing they're probably not using UTF-8; high-ASCII characters may be used, possibly even characters that are "undefined" in HTML (soft hyphen as a character rather than an entity? MS "fancy" quotes?). Anybody's guess how that's going to be interpreted, but I'm sure it depends on your browser's setting for preferred character set.

ktinkel
09-25-2007, 05:28 AM
Looking at request and response headers, and the source of the page, I note that neither the page, nor the server, defines a character set. At all. Boo!

The result is that your browser will have to guess, and (most likely) assume that it's getting the content in what it has configured as the preferred character set.

For me, that's UTF-8, using Firefox - and I see no "strange" characters, but I do see missing end quotes. I'm guessing they're probably not using UTF-8; high-ASCII characters may be used, possibly even characters that are "undefined" in HTML (soft hyphen as a character rather than an entity? MS "fancy" quotes?). Anybody's guess how that's going to be interpreted, but I'm sure it depends on your browser's setting for preferred character set.I copied the first couple of paragraphs, in which I saw odd characters in Safari, and pasted them into TextEdit, and everything was fixed.

Except the missing closing quote, which is evidently just not there. Sloppy — I’d expect something better from Scientific American.

Michael Rowley
09-25-2007, 06:32 AM
Marjolein:

MS "fancy" quotes

What can you mean by that? Any Microsoft program gives 'curly quotes' (inverted commas) that are in the upper ASCII set and in Unicode.

iamback
09-25-2007, 08:12 AM
What can you mean by that? Any Microsoft program gives 'curly quotes' (inverted commas) that are in the upper ASCII set and in Unicode.When they do not specify a character set, it's a pretty safe assumption that they're not using Unicode in any form (for a human, that is, not for a browser). But if they are using characters in the upper ASCII range without telling the browser that they are doing so, then for a browser that expects UTF-8 those characters are either invalid by themselves or will be interpreted as part of a double-byte character - which explains Carl's observation of the the upside-down question mark and the Korean character.

The "fancy" quotes as I call them (curly quotes) are just an example: anything in the upper ASCII is not valid (by itself) as a character in UTF-8.

dthomsen8
09-25-2007, 08:52 AM
...
Sloppy — I’d expect something better from Scientific American.

It would seem that Marjolein is onto the problem, no character set specified in the source. Can we somehow complain to Scientific American? If enough of us complain, maybe they might shape up.

iamback
09-25-2007, 08:57 AM
It would seem that Marjolein is onto the problem, no character set specified in the source.Specifying it in the source is just a fix you can use when you can't do it on the server (as when you 're using shared hosting can can't specify the HTTP headers yourself).

They're simply being extremely sloppy in not specifying what headers the pages are sent out with, when they are certainly in control of their own server - no excuse: it doesn't need to be in their source, it should be in the HTTP headers.

dthomsen8
09-25-2007, 08:58 AM
I wrote to webmaster@sciam.com as follows:

"Your website does not specify a character set, which means that the browser has to guess, sometimes with unhappy results. Please have your technical people look into specifying a character set in the source code."

So, everybody please join in!

iamback
09-25-2007, 09:00 AM
I wrote to webmaster@sciam.com as follows:

"Your website does not specify a character set, which means that the browser has to guess, sometimes with unhappy results. Please have your technical people look into specifying a character set in the source code."
Not bad. :) But why not point them to this thread?

Michael Rowley
09-25-2007, 10:47 AM
Marjolein:

The "fancy" quotes as I call them (curly quotes) are just an example: anything in the upper ASCII is not valid (by itself) as a character in UTF-8

The 'fancy' double opening quote, “, is 0147 in ANSI, because I can't remember the Unicode number or put it in here; Microsoft says it's U+201C, the 'Left Double Quotation Mark', because Windows 2000 and later use Unicode, though they still respond to the ANSI decimal code. I think if a browser is not set to use Unicode, it will try to find the ANSI equivalent, if there is one. If you haven't got the same Code Page as the writer of the Web page, you probably will get funny results.

iamback
09-25-2007, 12:28 PM
All I'm saying is if the browser is set to expect UTF-8 (like mine is) then anything in the upper ASCII range (including the "undefined" range) will result in unpredictable results, with "funny" characters. Just like Carl observed.

If you use any upper ASCII, you must define the charset. Apart from UTF-8 (where it is invalid), other character sets may have different interpretations fro those characters.

Michael Rowley
09-25-2007, 01:57 PM
Marjolein:

Apart from UTF-8 (where it is invalid), other character sets may have different interpretations fro those characters

All the letters that have decimal codes up to 255 in the Western Latin code page have identical decimal codes in Unicode; only the punctuation codes are different, with the exception of the few marks in the lower ASCII range. Therefore, it is too sweeping to claim that codes used in any character set are 'invalid' in UTF-8. Some are, some aren't.

I wonder how characters are interpreted in this forum: we often (well, fairly often) enter ANSI character codes that need interpretation in terms of Unicode.

iamback
09-25-2007, 10:24 PM
All the letters that have decimal codes up to 255 in the Western Latin code page have identical decimal codes in Unicode; only the punctuation codes are different, with the exception of the few marks in the lower ASCII range. Therefore, it is too sweeping to claim that codes used in any character set are 'invalid' in UTF-8. Some are, some aren't.The equivalents will not be invalid but the actual bytes with values 128-255 are invalid as characters in UTF-8. Only some of them can occur as part of a two-byte character - and then following byte has to make up a valid character.

So equivalents don't matter - the site doesn't say what it uses so the browser, set to UTF-8, interprets it that way (it sees only bytes, not characters); it enounters invalid bytes, and an occasional one it can interpret as part of a two-byte character, leading to strange things like a Korean character in the middle of a word.

I wonder how characters are interpreted in this forum: we often (well, fairly often) enter ANSI character codes that need interpretation in terms of Unicode.As far as I can tell they're simply converted to their equivalents in Unicode - not that hard to do. But of course this site knows it's using UTF-8 and proclaims the fact as well. Big difference.

Michael Rowley
09-26-2007, 05:26 AM
Marjolein:

The equivalents will not be invalid but the actual bytes with values 128-255 are invalid as characters in UTF-8

I'm afraid you've lost me. Take an example: Alt+0214 if typed here (I'm using Windows) converts when it reaches my computer & monitor (or computer & printer) into the succession of bits that are interpreted as the glyph Ö, which looks like a representation of Latin capital letter O with diaeresis. If I could convey the same instruction, represented as U+00D6, which also gives the same succession of bits having the same result. I say that if I am using Code Page 1252, Alt+214 and U+00D6 give the same succession of bits & bytes that result in Ö. You tell me that they're only 'equivalent'. That is what I don't understand, and am not likely to.

My point about this forum is that some ANSI codes are identically equal (not equivalent) to Unicode code points, others are not, if I'm using Code Page 1252; but whether they're equal or only equivalent all are correctly interpreted, providing the forum is interpreting between Code Page 1252 and Unicode.

iamback
09-26-2007, 07:11 AM
I'm afraid you've lost me. Take an example: Alt+0214 if typed here (I'm using Windows) converts when it reaches my computer & monitor (or computer & printer) into the succession of bits that are interpreted as the glyph Ö, which looks like a representation of Latin capital letter O with diaeresis.
Sure, but a browser retrieving a file from a server is not typing anything, nor is the server typing anything.

A server does not send glyphs: a server just sends a stream of bytes and is required to tell the browser how to interpret that stream of bytes by specifying the charset (AKA encoding) used for that stream of bytes.

The server we're discussing here, the one Scientific American is hosted on, sends a stream of bytes but does not in any way tell the browser what the charset is.

Now whether that stream of bytes actually was encoded as charset ISO-8859-1 (it probably is), the browser has no way to tell, and if it's set to prefer UTF-8 and isn't told otherwise by what it receives, then it will interpret that stream of bytes as UTF-8. Unfortunately, that stream of bytes is - in this case - not valid UTF-8 (no upper-ASCII values are valid UTF-8 characters by themselves) but the browser does its best, interpreting a 2-byte combination as a character when that happens to match.

My point about this forum is that some ANSI codes are identically equal (not equivalent) to Unicode code points, others are not, if I'm using Code Page 1252; but whether they're equal or only equivalent all are correctly interpreted, providing the forum is interpreting between Code Page 1252 and Unicode.Sure, I already said that what you type is converted to UTF-8 because this forum specifically uses UTF-8 and always advertises the fact.

But what happens on the forum is irrelevant here, since the server this forum is hosted on always specifies the charset in the HTTP headers, and the forum software does it again in the source code it sends; which is as it should be and which doesn't cause any problem.

Scientific American's website does not, not in the HTTP headers (as it should), and not in the page source (as it could), and that's the problem exposed here.

dthomsen8
09-26-2007, 07:48 AM
Not bad. :) But why not point them to this thread?

Everybody should complain directly, to: webmaster@sciam.com and perhaps that will produce results.

Whether they correct it in the source, or in a supposed server when I don't know if they have a server, is irrelevant to the user. Can the user tell if the character set is specified with HTTP when it is not in (X)HTML?

ElyseC
09-26-2007, 10:20 AM
I copied the first couple of paragraphs, in which I saw odd characters in Safari, and pasted them into TextEdit, and everything was fixed.

Except the missing closing quote, which is evidently just not there. Sloppy — I’d expect something better from Scientific American.If you paste text copied from web pages and email into BBEdit or TextWrangler and show invisible characters, you might see gray bullets. I see it a lot when processing text sent in the body of an email.

I'm grateful for TextWrangler's "zap gremlins" feature which removes such crud. TextEdit keeps it all and doesn't show it to you, but BareBones' products let you see and get rid of it.

ktinkel
09-26-2007, 11:48 AM
If you paste text copied from web pages and email into BBEdit or TextWrangler and show invisible characters, you might see gray bullets. I see it a lot when processing text sent in the body of an email.

I'm grateful for TextWrangler's "zap gremlins" feature which removes such crud. TextEdit keeps it all and doesn't show it to you, but BareBones' products let you see and get rid of it.Why don’t you just set TextWrangler’s default encoding to UTF-8. That generally solves the problem.

Michael Rowley
09-26-2007, 12:07 PM
Marjolein:

A server does not send glyphs: a server just sends a stream of bytes

It is apparent that your idée fixe is that ANSI codes between 128 and 255 are never the same as Unicode code points; the fact that some of them are (if the code page is 1252) and some of them aren't will have no effect on it. But nor do I believe that anything sends glyphs; I don't believe there are fairies at the bottom of my garden either.

As the Scientific American is written entirely in English, and most of the Web is too, I would expect a browser to assume Code Page 1252 if it given no other instructions. Like KT, I didn't see any mark in place of the obviously missing ” (with FF or IE), but there did seem to be space for it.

iamback
09-26-2007, 07:00 PM
Whether they correct it in the source, or in a supposed server when I don't know if they have a server, is irrelevant to the user.It may be irrelevant to the user but in my experience it doesn't hurt to tell a webmaster exactly what's wrong. ;)

And yes, they have their own server: their IP address is 216.68.232.11 which is part of netblock 216.68.0.0 - 216.68.255.255 that belongs to Zimmerman Communications, Inc. (http://www.zimcom.net/) - and while they offer shared hosting that's not their main business. Don't expect a site with 326,813 U.S. visitors per month (http://whois.domaintools.com/sciam.com) to use shared hosting. ;)


Can the user tell if the character set is specified with HTTP when it is not in (X)HTML?
Sure a (knowledgeable) user can tell - I did, didn't I? All you need is a tool to look at HTTP headers, and every web developer should have one! (I use the Proxomitron for that; I think there's also a Firefox plugin, and of course there are more.)

iamback
09-26-2007, 07:19 PM
It is apparent that your idée fixe is that ANSI codes between 128 and 255 are never the same as Unicode code points; the fact that some of them are (if the code page is 1252) and some of them aren't will have no effect on it.They may be "the same as" Unicode code points but they are not UTF-8 code points. And your "if the code page is" is the crux: No character set is defined so the browser has no way to know what the "code page" is. It asked for preferably UTF-8 and gets back a stream of bytes without a specification: what does it make of that? All it can do is assume it gets what it asked for because it isn't told otherwise.

As the Scientific American is written entirely in English, and most of the Web is too, I would expect a browser to assume Code Page 1252 if it given no other instructions.The default document character set for HTML is Unicode. But if, as in this case, the browser has specified it prefers UTF-8 encoding and does not receive a response that tells it what it gets is different from that (not only is the character set not specified anywhere, the language isn't either!), it has absolutely no reason to "assume" otherwise. Your expectation is wrong.

Steve Rindsberg
09-26-2007, 07:56 PM
Michael, the point is essentially this:

07 8D 3F 44 DD CF 09 42

There's your stream of bytes.

Now. Tell me what that spells.

You can't, because I haven't told you encoding system. It might be ASCII, might be EBCDIC, JIS, Unicode ... could be Klingon.

You don't know, so you've no way of determining how the bytes are to be interpreted or even how many bytes equate to a single character.

More or less the same situation as the web site Marjolein's describing.

iamback
09-26-2007, 11:20 PM
(if the code page is 1252) Since the site, nor the page, tells us what the character set actually is, I did a few experiments.

Follow after me with Firefox and go to this page (http://sciam.com/article.cfm?chanID=sa013&articleID=7750A576-E7F2-99DF-3824E0B1C2540D47&pageNumber=4&catID=2) (the next from that Carl referred to). Tell your browser to search for "Tilman" in the page, and you'll likely see one of the cited problems: an "upside-down question mark" in the middle of one of the names. I see that with my default preference UTF-8.Tilman Becker, Michael Carter and J¿rg Naeve, all ...
Clearly, that isn't the actual charset used, so what would it be? We can find out (we, as humans, having expectations and some knowledge of what languages look like, not the browser by itself).

The most likely, of course, for a website in English, would be to use ISO-8859-1. So use the browser to choose: View -> Character encoding -> more encodings -> West European -> Western (ISO-8859-1). Same "upside-down question mark" so that's not the actual charset.Tilman Becker, Michael Carter and J¿rg Naeve, all

The next most likely choice is your code page 1252 (which is a Windows character set - a very common mistake found on the web because it may result in mis-interpreted characters by browsers running on other OSs than Windows). Same procedure to switch, ending up with Western (Windows-1252). Guess what? Same "upside-down question mark" again so that's not the actual charset either!Tilman Becker, Michael Carter and J¿rg Naeve, all ...

What then? Third time lucky, I choose Western (MacRoman). Now suddenly the last name in the sentence yields a name that looks like a real name; thankfully, this forum will convert it for me to UTF-8:Tilman Becker, Michael Carter and Jørg Naeve, all ...

So what happened? They prepared their copy on a Mac and did not even think to publish it in a character set that will be correctly interpreted by most browsers (ISO-8859-1 or UTF-8), and also did not tell the browser that it actually is MacRoman.

There's still a fair chance a modern browser - like my Firefox - will be able to interpret that correctly, as long as it's told. But it isn't.

CarlSeiler
09-27-2007, 02:54 AM
Except the missing closing quote, which is evidently just not there. Sloppy — I’d expect something better from Scientific American.

Something is there for the closing quotes. As I said, I can see the unicode 8800 character ≠ in Linux with Firefox.

iamback
09-27-2007, 03:41 AM
Something is there for the closing quotes. As I said, I can see the unicode 8800 character * in Linux with Firefox.I don't see them. What have you set in Firefox as your default charset? What do you see when you try different ones? (See above.)

CarlSeiler
09-27-2007, 03:46 AM
I don't see them. What have you set in Firefox as your default charset? What do you see when you try different ones? (See above.)


In my Linux, it seems to be set on Western ISO-8859-1, but I changed it to UTF-8, and I see the same thing. In fact, I changed it to all sorts of things, and it still shows up as the not-equals character. There's clearly something there, even though I can't see it when using Windows.

iamback
09-27-2007, 04:01 AM
In my Linux, it seems to be set on Western ISO-8859-1, but I changed it to UTF-8, and I see the same thing. In fact, I changed it to all sorts of things, and it still shows up as the not-equals character. There's clearly something there, even though I can't see it when using Windows.Weird - did you try MacRoman?

CarlSeiler
09-27-2007, 04:29 AM
Weird - did you try MacRoman?


Yep, MacRoman did the same thing.

ktinkel
09-27-2007, 06:43 AM
Something is there for the closing quotes. As I said, I can see the unicode 8800 character * in Linux with Firefox.Weird. No matter what encoding I choose in Safari, I continue to get the hollow box. No matter what in Firefox and Netscape, I see nothing at the end of the quoted word. Or any other quoted word for that matter.

Michael Rowley
09-27-2007, 06:52 AM
Steve:

07 8D 3F 44 DD CF 09 42.
There's your stream of bytes. Now. Tell me what that spellsYou're trying to blind me with pseudo-science, and on the whole you've succeeded. The only thing I understand (I think) is that Unicode strings are displayed in groups of four 8-bit bytes, and that the highest decimal number in the Windows's (or the Mac's) character set) set is 255, which would be U+00FF in Unicode. All letters in the (Windows) Code Page 1252 have codes that correspond, one-for-one, with the Unicode code points. Now, the original page from the Scientific American had a “, whose code point in Unicode is not the same as either 147 (in Windows) or 210 (in the old Mac): but it displays correctly! But the ” that should follow it is missing (as seen by FireFox or Internet Explorer); if the explanation is that the browser cannot know what code is being used, I would expect both the codes for “ and ” to be liable to misinterpretation—or neither of them.

About the strange ¿, which appears unexpectedly on another page: that has the same code in Windows (191) and in Unicode (U+00BF); in the (old) Mac code (which is probably the same in OS 10), it is 192. Now, how could ¿ appear because the Mac code 191 (a Greek phi) had been misinterpreted as either a Windows code or a Unicode code point?

Now: your bytes. Are they meant to be a succession of eight Windows or Mac codes? If so, I would have expected 0007, 008D, 003F, etc. On the other hand, are they meant to be the hexadecimal numbers 078D, 3F44, DDCF, 0942, and clearly aren't Windows or Mac codes (which only go to 00FF).

I don't know, and don't need to know, how Unicode code points or decimal codes are presented, but it does seem to me that none of Marjolein's theories is tenable.

Steve Rindsberg
09-27-2007, 07:40 AM
Steve:

You're trying to blind me with pseudo-science, and on the whole you've succeeded. The only thing I understand (I think) is that Unicode strings are displayed in groups of four 8-bit bytes, and that the highest decimal number in the Windows's (or the Mac's) character set) set is 255, which would be U+00FF in Unicode. All letters in the (Windows) Code Page 1252 have codes that correspond, one-for-one, with the Unicode code points. Now, the original page from the Scientific American had a “, whose code point in Unicode is not the same as either 147 (in Windows) or 210 (in the old Mac): but it displays correctly! But the ” that should follow it is missing (as seen by FireFox or Internet Explorer); if the explanation is that the browser cannot know what code is being used, I would expect both the codes for “ and ” to be liable to misinterpretation—or neither of them.

About the strange ¿, which appears unexpectedly on another page: that has the same code in Windows (191) and in Unicode (U+00BF); in the (old) Mac code (which is probably the same in OS 10), it is 192. Now, how could ¿ appear because the Mac code 191 (a Greek phi) had been misinterpreted as either a Windows code or a Unicode code point?

Now: your bytes. Are they meant to be a succession of eight Windows or Mac codes? If so, I would have expected 0007, 008D, 003F, etc. On the other hand, are they meant to be the hexadecimal numbers 078D, 3F44, DDCF, 0942, and clearly aren't Windows or Mac codes (which only go to 00FF).

I don't know, and don't need to know, how Unicode code points or decimal codes are presented, but it does seem to me that none of Marjolein's theories is tenable.
Trying to blind you? Pseudo-science? If that's the way you're going to look at it, I've nothing more to say.

iamback
09-27-2007, 12:48 PM
Weird. No matter what encoding I choose in Safari, I continue to get the hollow box. No matter what in Firefox and Netscape, I see nothing at the end of the quoted word. Or any other quoted word for that matter.Yes, it looks like all quoted strings consistently are "missing" (or hiding) their end quotes. That feels like a conversion error to me - maybe there were curly quotes to begin with, where the opening quote (in MacRoman) was converted to a normal quote while the closing one wasn't?

Can you tell us what the numerical value of (curly) opening and closing (double) quotes are in MacRoman? That might give a clue.

iamback
09-27-2007, 12:50 PM
All letters in the (Windows) Code Page 1252 have codes that correspond, one-for-one, with the Unicode code points.But browsers don't see letters - they see only bytes and have to interpret them as letters. In order to do that, they need to know what encoding was used to create the stream of bytes. So where does "the (Windows) Code Page 1252" come in? Exactly nowhere since the site nor the page gives any clue what the "code page" (encoding, actually) is supposed to be.

My theories are quite tenable - I've provided plenty of proof already.

I'll stop trying to explain it to you now. Sorry, I give up.

Michael Rowley
09-27-2007, 02:49 PM
Steve:

If that's the way you're going to look at it, I've nothing more to say

I'm sorry if I have offended you: it was not intended. But I was attempting to convey that there is no obvious way that misinterpreting the encoding can result in some characters being right (presumably) and others being wrong. I have since learned that there are fairly simple algorithms for checking that the encoding is UTF-8.

CarlSeiler
09-27-2007, 06:53 PM
Weird. No matter what encoding I choose in Safari, I continue to get the hollow box. No matter what in Firefox and Netscape, I see nothing at the end of the quoted word. Or any other quoted word for that matter.

I think there must be something being sent at the end of each quoted word, otherwise, how would Firefox under Linux know to display that extra not-equals character?

Any way you look at it, though, it doesn't appear that their articles are looked at on the web by editors who who care or maybe they're all using the some browser/OS combination and to them it looks correct?

(Attaching screen shot from Puppy Linux virtual machine with Mozilla 1.89b with character encoding set to ISO-8859-1, but I get the same thing with MacRoman here, too)

Steve Rindsberg
09-27-2007, 10:17 PM
Steve:



I'm sorry if I have offended you: it was not intended. But I was attempting to convey that there is no obvious way that misinterpreting the encoding can result in some characters being right (presumably) and others being wrong. I have since learned that there are fairly simple algorithms for checking that the encoding is UTF-8.
Let it go then.

At the level we're talking about, the browser sees a string of numbers represented as strings of on and off bits. No letters, no glyphs, just numbers.

Until an agreed upon pattern of numbers at the very beginning of the stream of data tells it how to interpret the numbers that follow, it has to make assumptions which may or may not be correct.

Possibly the numbers should be taken one by one, each mapped to a specific character. Or two by two, four by four or even in some cases different numbers of bytes at a time, depending on whether it first sees certain "escape" sequences of numbers.

No way to know for sure w/o that standard part at the beginning of the data that reveals the encoding. That, in effect, provides the "look-up table" the browser can use to decide which character to display for each number. Or two. Or four. Or ....

Michael Rowley
09-28-2007, 05:34 AM
Steve:

the browser can use to decide which character to display for each number

Yes, but the browsers seem to be choosing sometimes one encoding, sometimes another: confusing.

By the way, I had not realized (until yesreday) that UTF+8 encoding results in something that looks like a string of ANSI etc. codes.

iamback
09-28-2007, 08:08 AM
I'm sorry if I have offended you: it was not intended. But I was attempting to convey that there is no obvious way that misinterpreting the encoding can result in some characters being right (presumably) and others being wrong. I have since learned that there are fairly simple algorithms for checking that the encoding is UTF-8.Yes, there are - and it isn't. There are NO simple algorithms to check what encoding it is supposed to be instead (apart from humans doing trial and error and looking if the text looks readable).

The browser just cannot tell and it isn't told - period.

Michael Rowley
09-28-2007, 12:43 PM
Marjolein:

Yes, there are - and it isn't

What 'isn't'?

There are bytes that cannot occur in UTF-8 (I've read), and if one is found, the encoding can't be UTF-8. Of course, one might not be found, alhough the encoding isn't UTF-8.

ElyseC
09-28-2007, 03:24 PM
Why don’t you just set TextWrangler’s default encoding to UTF-8. That generally solves the problem.I don't want gremlins to hide, I want to see them. Helps me troubleshoot. But I don't even see that option in Text Encoding.

ktinkel
09-28-2007, 05:13 PM
I don't want gremlins to hide, I want to see them. Helps me troubleshoot. But I don't even see that option in Text Encoding.The main Text Encoding pane lets you choose the collection of text encodings that will appear in the Encodings popups (when you save a file, for example).

All the unicode settings are pre-set (in BBEdit, anyway) and thus greyed out.

What I was suggesting is that you use the same encoding for files (HTML, CSS, etc.) as they use in real life: UTF-8. (Take a file, select Save As, then look at Options at the bottom of the screen.)

If you copy e-mail or other downloads that use unicode encoding into a text file set up with some other encoding you are likely to see odd characters, what you call crud. But in many cases that crud is of your own devising.

ElyseC
09-28-2007, 09:17 PM
If you copy e-mail or other downloads that use unicode encoding into a text file set up with some other encoding you are likely to see odd characters, what you call crud. But in many cases that crud is of your own devising.I don't recall seeing the gremlins when I paste into an open TextWrangler document, but I often see them when I open email that I've saved to text file. I notice the gremlins in email from specific people, but not all email saved to text file has them.

That's why I'm skeptical that the gremlins are of my own devising.

iamback
09-29-2007, 01:39 AM
What 'isn't'?The page isn't UTF-8.

There are bytes that cannot occur in UTF-8 (I've read), and if one is found, the encoding can't be UTF-8. Of course, one might not be found, alhough the encoding isn't UTF-8.Yes, and those bytes include all "upper-ASCII" as I've stated before. But that just makes them "illegal characters" while some of them may be part of a valid two-byte character. But the browser isn't told it's not UTF-8 and it isn't told what it is instead.