Lately, I’ve been getting more and more into Unicode and localization stuffs, so when I saw a bug come in at work regarding Hebrew text in our web panel, I jumped at the opportunity to take a stab at the issue.
Unfortunately, the deeper I investigate the issue, the more I realize it will require a lot of sweeping changes to our web panel codebase and practices in order to handle properly. Or, I could just make things happen at a lower level and things will be happy.
The main problem I’m coming across is the inconsistency of how Unicode characters and HTML special characters are encoded when being sent along as form data. Take the ampersand character (&) for example. When sent in a form, by itself, it always is sent across literally as the ampersand character. Less than and greater than characters are the same, they come across literally. The problem arises when you start sending non-ascii characters. The following text, which has been copy/pasted from here is sent across as the HTML entities which represent the Unicode codes:
היישוב היהודי בחברון הוכרז שטח צבאי סגור
(היישוב, etc)
This is all fine and dandy. The way we normally have been doing things is taking whatever comes in and taking it verbatim. We sanitize on the display end of things in most cases, with HTML::Entities and converting &, <, and > characters to their HTML entity counterparts. However, you’ll notice that there are & characters in the text being sent to us, so now the browser is encoding it, and we’re encoding it again! So it displays improperly for users (displaying the literal Ӓ instead of the character they expected) as well as being stored in our database incorrectly (for non-HTML usage, like when sending automated emails and such). However, if we simply decode the *entire* string, then we accidentally decode literal & and < etc. that people put in as form input, and when we encode it later, it doesn’t come back properly. In either situation, all or nothing, we munge information. At least when we do *no* decoding, we don’t *lose* information, but it’s not correct, either.
The solution I’ve come up with here is to simply only HTML entities decode Unicode entities (Ӓ, etc) and leave all of the other entities alone. This is something I can easily wedge into our lower level POST/GET processing and make transparent to the developer, but I’m unsure if it’s the correct approach, or if it will even work. I’m definitely going to need to play around with it some more before I make anything live, but I’ll be talking about my experiences with it all right here, on this very blog, for the world to see. Unfortunately, there don’t seem to be any pre-made perl libraries to do this for me. HTML::Entities looks promising, but its decode method doesn’t seem to be able to take exceptions. I’ll probably end up stealing a bunch of code from it and piecing it together.
As a quick side note: WordPress (at least as of the version I’m running, which is 2.7-rc1) handles this all excellently. Not entirely sure if it’s just because of the ‘visual’ editor, or if it handles it properly on the server side, or what, but I’ll definitely be looking more into this as well.
HTML Entities and form handling.. fun times!
Lately, I’ve been getting more and more into Unicode and localization stuffs, so when I saw a bug come in at work regarding Hebrew text in our web panel, I jumped at the opportunity to take a stab at the issue.
Unfortunately, the deeper I investigate the issue, the more I realize it will require a lot of sweeping changes to our web panel codebase and practices in order to handle properly. Or, I could just make things happen at a lower level and things will be happy.
The main problem I’m coming across is the inconsistency of how Unicode characters and HTML special characters are encoded when being sent along as form data. Take the ampersand character (&) for example. When sent in a form, by itself, it always is sent across literally as the ampersand character. Less than and greater than characters are the same, they come across literally. The problem arises when you start sending non-ascii characters. The following text, which has been copy/pasted from here is sent across as the HTML entities which represent the Unicode codes:
(היישוב, etc)
This is all fine and dandy. The way we normally have been doing things is taking whatever comes in and taking it verbatim. We sanitize on the display end of things in most cases, with HTML::Entities and converting &, <, and > characters to their HTML entity counterparts. However, you’ll notice that there are & characters in the text being sent to us, so now the browser is encoding it, and we’re encoding it again! So it displays improperly for users (displaying the literal Ӓ instead of the character they expected) as well as being stored in our database incorrectly (for non-HTML usage, like when sending automated emails and such). However, if we simply decode the *entire* string, then we accidentally decode literal & and < etc. that people put in as form input, and when we encode it later, it doesn’t come back properly. In either situation, all or nothing, we munge information. At least when we do *no* decoding, we don’t *lose* information, but it’s not correct, either.
The solution I’ve come up with here is to simply only HTML entities decode Unicode entities (Ӓ, etc) and leave all of the other entities alone. This is something I can easily wedge into our lower level POST/GET processing and make transparent to the developer, but I’m unsure if it’s the correct approach, or if it will even work. I’m definitely going to need to play around with it some more before I make anything live, but I’ll be talking about my experiences with it all right here, on this very blog, for the world to see. Unfortunately, there don’t seem to be any pre-made perl libraries to do this for me. HTML::Entities looks promising, but its decode method doesn’t seem to be able to take exceptions. I’ll probably end up stealing a bunch of code from it and piecing it together.
As a quick side note: WordPress (at least as of the version I’m running, which is 2.7-rc1) handles this all excellently. Not entirely sure if it’s just because of the ‘visual’ editor, or if it handles it properly on the server side, or what, but I’ll definitely be looking more into this as well.