Also see the list of articles, none to be taken seriously.

Check out the beta of the new RI Nexus site, full of news and resources about information techonology and digital media in Rhode Island.

I wrote some new Drupal modules to support it. Feedback and suggestions are welcome!

Read and Post Comments

We have a new location and a guest speaker for our April meetup. Nate Abele of the CakePHP project will be here to show off the rapid web development framework and answer questions. We'll make time for discussion too.

The new location is a really nice conference room at the Johnson & Wales Academic Center, with everything that implies (i.e. a projector).

All programming skill levels welcome. If you're going to be in the Providence, RI area and can make it, please see here for more details and to RSVP:

Providence PHP April Meetup [meetup.com]

Read and Post Comments

Lots of geeky news. I took over officially as organizer of the Providence PHP meetup this month, and our next event is at 729 Hope St on Tuesday, February 6 at 7 PM. So join us for coffee, pastry, a wide-ranging, informal discussion of anything related to programming with PHP, or all three.

This time, we’ll probably share some of the projects we’re working on, so bring some screenshots or a quick demo if you’d like. (If this starts to run long, we can always go into more depth next month.)

Please RSVP.

Read and Post Comments

If your usual lunch crowd doesn’t talk enough about computers for your taste, escape with us to the monthly Providence Web Developers Lunch Hour. (Chris, the usual organizer, won’t be able to make it, so I’m hosting in his place.) Please RSVP here.

Read and Post Comments

If you’re interested in web development topics, work in the Providence area, and eat food, you’d be perfect for the lunch hour meetup event happening downtown today (Thursday, Nov. 9) at noon. I’ll be filling in for the regular organizer.

For more information and to RSVP, please visit the event’s Meetup page.

Read and Post Comments

Why does Windows still suck?

The most surprising figure in the article—that 91% of PCs are infected (or maybe the word should be “infested”)—sounds high, but gets some anecdotal support in comments in Brent Simmons’ weblog.

Read and Post Comments

Broken Windows: With viruses, worms, and vulnerabilities in the news, John Gruber wrote an excellent piece. “Here’s a billion-dollar question: Why are Windows users besieged by security exploits, but Mac users are not?”

And, like clockwork, here comes the latest Windows vulnerability:

Internet Explorer Carved Up By Zero-Day Hole:

“Two new vulnerabilities have been discovered in Internet Explorer which allow a complete bypass of security and provide system access to a computer, including the installation of files on someone’s hard disk without their knowledge, through a single click.

Worse, the holes have been discovered from analysis of an existing link on the Internet and a fully functional demonstration of the exploit have been produced and been shown to affect even fully patched versions of Explorer.

It has been rated ‘extremely critical’ by security company Secunia, and the only advice is to disable Active Scripting support for all but trusted websites.”

The article goes on to say that the code exploits three holes in Internet Explorer for Windows, including one that has been known since August 2003, and there’s no patch available for any of them. (You could turn off Active Scripting, which breaks functionality on many sites, or stop browsing web sites you don’t trust completely. If that’s not acceptable, you have to switch another browser such as Mozilla, or switch to a Mac.)

Read and Post Comments

Macintouch has some interesting commentary on anti-counterfeiting measures that Adobe quietly slipped into Photoshop CS. The program now detects images containing currency and prevents you from working with them, even though doing so is perfectly legal, as long as you don’t then make a printout that’s double-sided or very close in size to the original.

[Tim Wright] It would be fairly easy to create other documents which would mistrigger this pattern [described in eurion.pdf].

Now the cat is out of the bag, I fully expect this to start appearing on magazine page backgrounds, books, any documents considered "sensitive", grocery coupons, etc, which will rapidly render colour photocopiers pretty useless until they disable this feature.

For more amusement, why not put it onto t-shirts or baseball caps, which will neatly prevent people from printing (or editing) photos of you? I’m sure more inventive people will be able to think of plenty of other uses, like car decorations, wallpaper, badges and so on...

Read and Post Comments

"Be conservative in what you do, be liberal in what you accept from others."

This law is making the rounds again, with arguments both pro and con. Here are my thoughts.

Postel’s Law is a great, useful principle for writing programs that communicate. However, the law is so elegant and successful that it’s easy to regard it as an absolute. And then, because be liberal in what you accept is such an open-ended goal, people go too far. Here’s an analysis of the problem, followed by a suggestion.

The first half of Postel’s Law, be conservative in what you transmit, is a well-specified rule with a clearly defined goal. The tools to achieve it are specs and validators. But the vague goal of the other half, to be liberal in what you accept, can turn into a bottomless hole. There’s hardly any limit to how loose an interpretation of the spec can get, how cleverly the code can guess at the sender’s intent, and how much code for special cases you can write to fix invalid data. Because such code can provide an immediate user benefit and a market advantage, it turns into an arms race. Often, the code ends up violating the spec itself, intentionally or unintentionally, which we’ll see below.

The Growing Hole

Plenty has been written praising be liberal in what you accept. So I won’t repeat it. Here are some of the problems:

It enlarges the spec. Every additional error condition fixed by a market leader becomes an (undocumented) part of the spec. Senders come to rely on it. The senders probably don’t even realize that their output is wrong because of the way software is written.

In the edit-run-debug cycle of the typical software development process, testing is often done just by trying the program out, not through any mathematical process or formal validation suite. HTML authoring tends to be done the same way. Though modern XP [Extreme Programming] test-first practices call for a thorough suite of test cases to be written before the actual code, most software still doesn’t have this advantage. HTML is an easy case for validation, with scores of easily accessible validators already written, much easier to test than most program code, yet the bulk of new pages in the world have probably never been through an HTML validator.

The problem is that, even after removing all obvious bugs, the product of this run-test-debug cycle can only run at the "seems-to-work" level. There’s no guarantee that the it’s really working, and specifically no proof that the program or web page is being conservative in what it sends. If it’s a program that communicates with other types of programs, the developer will test it with real examples of those programs. So, when a developer writing program Z needs to interoperate with programs such as A and B, and A and B are silently fixing errors in the output of program Z, Z’s developer will declare the code "working" (because to all appearances, it is), and say "ship it!". And everything will be fine until an edge case comes along that program A or program B either can’t fix or interpret differently. Or until someone tries program Z with program C, which didn’t get the memo about all the particular types of errors that programs A and B fix. All this because Z had a latent bug, due to the second half of Postel’s Law, because:

It hides violations of the other half of Postel’s Law. In other words, by being more liberal on the receiver, it becomes more difficult to find bugs in the sender.

As an example, Microsoft Internet Explorer sports what some have called a "ridiculous tolerance for errors in HTML markup". Microsoft FrontPage has a well-known tendency to silently create invalid HTML markup. (One of the bugs: FrontPage 98 and 2000 will occasionally go through a valid page with spacer images and replace all of their alt="" attributes with the lone word alt, which is invalid HTML. A developer familiar with the SGML foundations of HTML might think the fix is to parse this as a boolean attribute, alt="alt", but IE and other browsers choose to interpret it as alt="".) Though I doubt that any such bugs are intentional, the tendencies of the two products feed on each other. If the developers of FrontPage were testing with a browser that flagged such errors, it’s likely that the bugs wouldn’t have made it to release.

The bind here is that Postel’s Law tries to make things work as often as possible for users, but people trying to test other programs are users too, and errors are also covered up for them. One way out of this would be some kind of Postel Kill Switch, a strict mode intended for interoperability testing. (Turning off the other half of the law at the same time, causing the program to send out data malformed in various ways, would be harder to switch on programmatically.) Though the strict mode might do some good, it has some drawbacks: it would require a different code path, making it prudent to test both modes; and even without the extra work that would entail, testers might not bother turning the feature on every time in the first place.

Market Forces

Even though it’s usually more work to be more liberal, developers with time or money on their hands will still do it. They are often motivated just to provide convenience for their users, but with competitors in the same market, it has a predictable effect:

It increases the cost of entry. Accepting everything is a greedy strategy. It rewards the incumbents, and makes more work for newcomers. Not only do the newcomers have to catch up with all the error-fixing logic that the market leaders have been writing since the beginning, they have to somehow figure out what all those error conditions are. They’re not in the spec, and it’s almost certain that they’re not publicly documented anywhere. Even if the types of errors to be fixed were known, the new programs would have to fix them exactly the same way as the old ones, even in the face of multiple overlapping errors or ambiguous edge cases. And in some cases, this may require disregarding the spec, deliberately misinterpreting a valid document to match an overzealous fix.

Safety

This leads to one of the most damning consequences:

It makes software unreliable. Even the safest-looking fix can have unexpected consequences once others depend on it. (Which they will, and, unless the fix was added purely on speculation, already do.)

For instance, if you’re writing an HTML parser, and you see a lone ampersand (technically illegal--it should be encoded as &) the liberally accepting thing to do is to display an ampersand, just as if it had been encoded properly. Which is fine, at that moment. If the users knew what had happened, they would probably thank you for soldiering on through the rest of the document and not giving up right there. But in reality they don’t even know it happened, and as the years go by, they will keep turning out pages with unencoded ampersands. (It’s the testers-are-also-users problem again.) New high-end content management systems will be deployed without anyone working with the system even knowing that they’re entering raw HTML into some of the text fields, and that they have to be careful with ampersands (yes, this already happens). A validator may catch the problem if it happens to crop up on the page at the time it’s checked, but most likely, no one will notice until the unlucky day that someone writes a classified for an electric guitar setup saying "For Sale: guitar& $200." Then the amp will just mysteriously disappear on the post, putting a guitar and $200 on sale. (If you think the example is contrived, note that in another attempt to apply Postel’s Law, real-world browsers end up expanding the error domain even further: "guitars&amplifiers" will have three letters dropped out of it, because the first browsers judged that to be most likely what the author intended. However, if you added spaces around the punctuation, whole words would show up. This is the kind of bizarre behaviour that makes people distrust computers.)

At its root, the ampersand problem is really just confusion over a weakly specified input format. (You can find similar examples on display in comment forums across the web, which often treat visitors to the spectacle of a web developer repeatedly trying to describe an HTML tag, only to have the tag itself disappear.) However, in this case being liberally accepting didn’t fix the problem; it just made its symptoms more rare, and therefore the real problem harder to find, more capricious, and more puzzling.

In an effort to do the right thing, some programs intentionally go against the spec. Internet Explorer (and therefore Outlook, when opening HTML mail) will disbelieve the content type specified by the web server, and choose a different type itself based on heuristics, a behaviour which is even documented. An XHTML document might not be rendered if it starts with a comment that’s too long, or a plain text file might be parsed as HTML because it contained a tag-like sequence of characters. The HTTP spec specifically forbids browsers to second-guess the content type provided by the server, but IE does it anyway. This makes IE compatible with many badly-configured web servers. It also frustrates the owners of well-configured web servers for whom IE always guesses wrongly.

In certain cases, outright bugs in complex code designed to tolerate many errors has the ironic effect of limiting the spec. For example, RSS is based on XML, but because of the existence of RSS feeds with invalid XML, liberal RSS parsers can’t be based on real XML parsers. Real XML parsers are thoroughly tested and widely deployed. But instead, the developers have to roll their own quasi-XML parsers (increasing the barrier to entry). The chance of getting some part of the XML spec wrong is high (making the software unreliable). This in turn has made feed developers reluctant at various times to begin using any XML features that don’t already appear in the most common feeds, such as CDATA blocks in the description element, namespaces, and XML comments, because they might break regexp-based parsers. (Mark Pilgrim’s Ultra Liberal Feed Parser is a solution for Python programmers, and while it gets everything right as far as I know, it still doesn’t much help developers in other languages.)

In this example, XML is special, because the XML spec itself violates Postel’s Law. It calls for clients to terminate parsing entirely when they encounter malformed content. While it may have been better if this decision hadn’t been made, that’s the current reality of XML parsers. Replacing them all with less flighty ones would be nice. (Any takers?)

Security

Finally, security. A whole class of security vulnerabilities results from automatically fixing errors in input data. Because the set of errors to be fixed is ill-defined, software downstream can take a radically different action than what the software upstream thought possible. Malicious users can exploit this.

Think of the difficulty just of reliably filtering out dangerous HTML tags and attributes from a comment left on a web site. The browser is working as hard as possible to be liberal in its definition of an HTML tag, working by unknown rules to fix almost-tags. Can the author of such a filter ever be truly certain that nothing gets through? (Thinking about this, the only sure way around it without writing an entire HTML validator would be to fully parse the HTML input into an intermediate HTML-free representation, then write it back out as guaranteed-valid HTML code. The only thing left to worry about: an overzealous fix that would cause the valid code to be misinterpreted.)

The rule: Arbitrary fixes to bad input data will thwart any previous filtering or security checking of the data.

What to do?

A Suggestion

Future specs could require implementations to report whenever they encounter and correct errors, with an interface that could be as simple and non-intrusive as an exclamation point icon. (A newsreader, for instance, would place it next to a suspect newsfeed and link it to the Feed Validator.) There’s nothing particularly new about this kind of interface; several products, such as Opera, already do something similar. The trick would be that that it would be required by the spec. The market leaders would be compelled to adopt it, not just the smaller products.

This behavior wouldn’t hamper a program’s ability to accept liberally; it would just let testers and other interested users know that the data had not been sent conservatively. It would thus remove the conflict of interest between the two parts of the law. The feature would be on by default, so testers wouldn’t need to activate it, but it wouldn’t be so annoying that users wanted it off (as a modal alert box would be).

This doesn’t mean that each implementation has to have a full-fleged validator aboard. Only errors detectable by reasonably straightforward means and cases where the implementation goes to extra lengths to make sense of the input would have to be flagged. That does give implementations some wiggle room.

It’s important that this minor error display mechanism be required in order to comply with the spec. It can’t be voluntary on the part of the implementors. There’s nothing in it for them, at least not directly. To record the error as it’s fixed and display the fact takes extra code, albeit not much. Considering that the benefit goes mainly to future implementors as well as users of less liberal implementations that don’t know how to handle the same error, implementors will tend not the write that code unless nudged.

And the developers can be nudged, even for specs without trademarks or an official logo program. Having the requirement enshrined in the spec at least provides some social pressure for implementors to comply.

And some other things that seem to make sense right now:

  • Developers should also take great care to hold back and not misinterpret technically valid input in an attempt to do the right thing. Internet Explorer’s habit of second-guessing the Content-Type header is the kind of thing to avoid.
  • By the same token, to provide tolerant XML parsing, use a real standards-compliant XML parser first, and fall back to a handcoded quasi-XML parser only when that fails. (Or, if you can absolutely guarantee that the result will be identical, use the quasi-XML parser alone, but that guarantee is hard to make.)
  • To avoid unintentionally thwarting security filters, all heroic fixes to input should be made as far upstream in the call chain as possible. If there’s still a danger the downstream code will try to outsmart the upstream code, the upstream code could rewrite the input to be canonical and unmisinterpretable.
Read and Post Comments

Some good quotes here from a recent speech.

Read and Post Comments

Next Page »