Also see the list of articles, none to be taken seriously.
Here’s a way to back up iPhoto’s image comments into an easy-to-read flat directory structure. (Translation: one big folder.) You’d want to do this when archiving your photos to CD or DVD, or when trying to merge photo libraries, or when leaving iPhoto for another program, or at any other time you want your comments saved in a non-proprietary, easily readable format.
As you may have read last week, when I upgraded to iPhoto 4, all the image descriptions temporarily disappeared from my online photo albums. (I caught the problem on my own staging server before it appeared on this site.) The culprit was a change in the way iPhoto stores photo comments. Comments are now entirely gone from the easy-to-parse AlbumData.xml file; iPhoto now stores them in a binary format that appears to be proprietary.
AppleScript to the rescue. Last week’s script saved the comments to text files and generated a directory structure that exactly paralleled iPhoto’s library, with one text file for each comment. These files were in folders for each day, which were in turn inside folders for each month, etc., guaranteeing there would be no name conflicts. I had rejected using the internal ID of each picture (which would have allowed a flat conflict-free directory structure) because the ID wasn’t user-visible anywhere in the iPhoto interface, making comment files named for the ID difficult to map back to the original pictures.
One of the comments on that post asked for a version that generated the comment files in one folder, based on the image’s filename. That was a good idea. Though the filename is not guaranteed to be unique, it often is in practice. Most digital cameras save unique serial numbers for each picture as part of the filename. So this is enough for most people. (The exceptions would be if you have more than one digital camera using a similar naming convention, or if your camera is configured to reset its numbering between rolls.)
If you like guaranteed accuracy, use my original script; if you like simplicity, use the following alternate script. It will only save one of the conflicting comments if photo filenames are duplicated. Dropping the parallel folder structure simplified the script, since this version doesn’t need to employ any POSIX path manipulation.
Copy the following into Script Editor and run. Tested with iPhoto 4.0 on Mac OS X 10.3. (It may also work with earlier versions; drop me a comment below if you’ve tried it.)
I’m now using the release WordPress 1.0 to generate the content area of this weblog. (The headers, footers, site navigation, and subscription list are generated by ShearerSite.)
In many ways, it’s going from one extreme to the other. My own system is based on static rendering without a database, to the point that the original data itself is kept in RSS-compliant XML files on the site, and HTML files are generated from those. So there’s no programmatic server overhead for retrieval, but there is for authoring, since all the dependent pages have to be re-rendered on the spot. I’m still a fan of this type of system, but I wanted to try something different. WordPress is about as different as you can get: by default, it runs a battery of regular expressions--dozens upon dozens of them--over each post to format it at retrieval time. (Some kind of static caching may be on its way, though, judging from hints in the database schema.) The administration interface is mostly very good, making it much easier to perform administration tasks such as adding new categories than my homegrown config-file-based system did.
Pros of WordPress: very hackable (the good way, by the site owner); terrific setup routines; good navigation controls, easy to set up; well-rounded feature set.
Cons: frequently passes HTML through finicky regular expressions; too much use of addslashes() for my taste, including some double applications; a few bugs in 1.0 (though, to be fair, 1.0.1 final is imminent).
Some changes I made to my own copy include:
I bought the upgrade to the Apple’s iLife suite, released on Friday. Here’s a gotcha for developers who parse iPhoto’s AlbumData.xml file, though it doesn’t directly affect most users. It affects me, because my own code parses AlbumData.xml to generate my web-based photo albums (such as the England trip pictures I just posted).
Though the overall format of iPhoto’s XML file stays the same (and my script had no trouble reading it), the Comments and Date fields are gone! The Date field is renamed and in a different format, which is no problem to work around because the image file’s embedded EXIF data contains the date as well. The missing Comments field is a different story.
From my quick inspection, the comment data seems to be only stored in a newly introduced iPhoto.db file, which is in some binary format. The rationale for this is presumably performance, but that doesn’t completely make sense, since the photo title is still stored in the XML file and it may be changed just as often.
In any case, here’s a workaround that uses AppleScript to write a parallel folder structure holding just the comments, one per text file. Paste the following into a Script Editor window and run. Use this anytime you’d like to protect your comments from the vagaries of software or platform transitions or upgrades. (The parallel folder structure helps this; the script could have used iPhoto’s internal IDs and generated all the files in a single folder, but that wouldn’t have been as forward-compatible.) GPL-licensed.
commentCommonBaseDir = os.path.expanduser("~/Pictures/") commentOrigDir = os.path.join(commentCommonBaseDir, "iPhoto Library") commentParallelDir = os.path.join(commentCommonBaseDir, "iPhoto Library - My Comments Cache") commentFileSuffix = ".comment.txt" def getCommentForFile(imagePath): if not imagePath.lower().startswith(commentOrigDir.lower()): raise ('Error: image does not appear to be in iPhoto Library; ' + 'cannot compute comment path. Image: "%s". Library: "%s".' ) \ % (imagePath, commentOrigDir) commentPath = os.path.join(commentParallelDir, imagePath[len(commentOrigDir)+1:]) + commentFileSuffix if os.path.isfile(commentPath): print "Read comment for " + imagePath return open(commentPath, 'r').read() return ''
Here are pictures of the scenery in Dartmouth taken during my trip to England over the New Year. Uploading has been slow due to the sudden death of my cable modem. Family pictures are coming next, and are semi-private: you’ll need to enter "family" as the username and my mother’s maiden name in lowercase as the password. The aunts, uncles, and cousins involved should have no problem figuring that one out.
Macintouch has some interesting commentary on anti-counterfeiting measures that Adobe quietly slipped into Photoshop CS. The program now detects images containing currency and prevents you from working with them, even though doing so is perfectly legal, as long as you don’t then make a printout that’s double-sided or very close in size to the original.
[Tim Wright] It would be fairly easy to create other documents which would mistrigger this pattern [described in eurion.pdf].
Now the cat is out of the bag, I fully expect this to start appearing on magazine page backgrounds, books, any documents considered "sensitive", grocery coupons, etc, which will rapidly render colour photocopiers pretty useless until they disable this feature.
For more amusement, why not put it onto t-shirts or baseball caps, which will neatly prevent people from printing (or editing) photos of you? I’m sure more inventive people will be able to think of plenty of other uses, like car decorations, wallpaper, badges and so on...
"Be conservative in what you do, be liberal in what you accept from others."
This law is making the rounds again, with arguments both pro and con. Here are my thoughts.
Postel’s Law is a great, useful principle for writing programs that communicate. However, the law is so elegant and successful that it’s easy to regard it as an absolute. And then, because be liberal in what you accept is such an open-ended goal, people go too far. Here’s an analysis of the problem, followed by a suggestion.
The first half of Postel’s Law, be conservative in what you transmit, is a well-specified rule with a clearly defined goal. The tools to achieve it are specs and validators. But the vague goal of the other half, to be liberal in what you accept, can turn into a bottomless hole. There’s hardly any limit to how loose an interpretation of the spec can get, how cleverly the code can guess at the sender’s intent, and how much code for special cases you can write to fix invalid data. Because such code can provide an immediate user benefit and a market advantage, it turns into an arms race. Often, the code ends up violating the spec itself, intentionally or unintentionally, which we’ll see below.
Plenty has been written praising be liberal in what you accept. So I won’t repeat it. Here are some of the problems:
It enlarges the spec. Every additional error condition fixed by a market leader becomes an (undocumented) part of the spec. Senders come to rely on it. The senders probably don’t even realize that their output is wrong because of the way software is written.
In the edit-run-debug cycle of the typical software development process, testing is often done just by trying the program out, not through any mathematical process or formal validation suite. HTML authoring tends to be done the same way. Though modern XP [Extreme Programming] test-first practices call for a thorough suite of test cases to be written before the actual code, most software still doesn’t have this advantage. HTML is an easy case for validation, with scores of easily accessible validators already written, much easier to test than most program code, yet the bulk of new pages in the world have probably never been through an HTML validator.
The problem is that, even after removing all obvious bugs, the product of this run-test-debug cycle can only run at the "seems-to-work" level. There’s no guarantee that the it’s really working, and specifically no proof that the program or web page is being conservative in what it sends. If it’s a program that communicates with other types of programs, the developer will test it with real examples of those programs. So, when a developer writing program Z needs to interoperate with programs such as A and B, and A and B are silently fixing errors in the output of program Z, Z’s developer will declare the code "working" (because to all appearances, it is), and say "ship it!". And everything will be fine until an edge case comes along that program A or program B either can’t fix or interpret differently. Or until someone tries program Z with program C, which didn’t get the memo about all the particular types of errors that programs A and B fix. All this because Z had a latent bug, due to the second half of Postel’s Law, because:
It hides violations of the other half of Postel’s Law. In other words, by being more liberal on the receiver, it becomes more difficult to find bugs in the sender.
As an example, Microsoft Internet Explorer sports what some have called a "ridiculous tolerance for errors in HTML markup". Microsoft FrontPage has a well-known tendency to silently create invalid HTML markup. (One of the bugs: FrontPage 98 and 2000 will occasionally go through a valid page with spacer images and replace all of their alt="" attributes with the lone word alt, which is invalid HTML. A developer familiar with the SGML foundations of HTML might think the fix is to parse this as a boolean attribute, alt="alt", but IE and other browsers choose to interpret it as alt="".) Though I doubt that any such bugs are intentional, the tendencies of the two products feed on each other. If the developers of FrontPage were testing with a browser that flagged such errors, it’s likely that the bugs wouldn’t have made it to release.
The bind here is that Postel’s Law tries to make things work as often as possible for users, but people trying to test other programs are users too, and errors are also covered up for them. One way out of this would be some kind of Postel Kill Switch, a strict mode intended for interoperability testing. (Turning off the other half of the law at the same time, causing the program to send out data malformed in various ways, would be harder to switch on programmatically.) Though the strict mode might do some good, it has some drawbacks: it would require a different code path, making it prudent to test both modes; and even without the extra work that would entail, testers might not bother turning the feature on every time in the first place.
Even though it’s usually more work to be more liberal, developers with time or money on their hands will still do it. They are often motivated just to provide convenience for their users, but with competitors in the same market, it has a predictable effect:
It increases the cost of entry. Accepting everything is a greedy strategy. It rewards the incumbents, and makes more work for newcomers. Not only do the newcomers have to catch up with all the error-fixing logic that the market leaders have been writing since the beginning, they have to somehow figure out what all those error conditions are. They’re not in the spec, and it’s almost certain that they’re not publicly documented anywhere. Even if the types of errors to be fixed were known, the new programs would have to fix them exactly the same way as the old ones, even in the face of multiple overlapping errors or ambiguous edge cases. And in some cases, this may require disregarding the spec, deliberately misinterpreting a valid document to match an overzealous fix.
This leads to one of the most damning consequences:
It makes software unreliable. Even the safest-looking fix can have unexpected consequences once others depend on it. (Which they will, and, unless the fix was added purely on speculation, already do.)
For instance, if you’re writing an HTML parser, and you see a lone ampersand (technically illegal--it should be encoded as &) the liberally accepting thing to do is to display an ampersand, just as if it had been encoded properly. Which is fine, at that moment. If the users knew what had happened, they would probably thank you for soldiering on through the rest of the document and not giving up right there. But in reality they don’t even know it happened, and as the years go by, they will keep turning out pages with unencoded ampersands. (It’s the testers-are-also-users problem again.) New high-end content management systems will be deployed without anyone working with the system even knowing that they’re entering raw HTML into some of the text fields, and that they have to be careful with ampersands (yes, this already happens). A validator may catch the problem if it happens to crop up on the page at the time it’s checked, but most likely, no one will notice until the unlucky day that someone writes a classified for an electric guitar setup saying "For Sale: guitar& $200." Then the amp will just mysteriously disappear on the post, putting a guitar and $200 on sale. (If you think the example is contrived, note that in another attempt to apply Postel’s Law, real-world browsers end up expanding the error domain even further: "guitars&lifiers" will have three letters dropped out of it, because the first browsers judged that to be most likely what the author intended. However, if you added spaces around the punctuation, whole words would show up. This is the kind of bizarre behaviour that makes people distrust computers.)
At its root, the ampersand problem is really just confusion over a weakly specified input format. (You can find similar examples on display in comment forums across the web, which often treat visitors to the spectacle of a web developer repeatedly trying to describe an HTML tag, only to have the tag itself disappear.) However, in this case being liberally accepting didn’t fix the problem; it just made its symptoms more rare, and therefore the real problem harder to find, more capricious, and more puzzling.
In an effort to do the right thing, some programs intentionally go against the spec. Internet Explorer (and therefore Outlook, when opening HTML mail) will disbelieve the content type specified by the web server, and choose a different type itself based on heuristics, a behaviour which is even documented. An XHTML document might not be rendered if it starts with a comment that’s too long, or a plain text file might be parsed as HTML because it contained a tag-like sequence of characters. The HTTP spec specifically forbids browsers to second-guess the content type provided by the server, but IE does it anyway. This makes IE compatible with many badly-configured web servers. It also frustrates the owners of well-configured web servers for whom IE always guesses wrongly.
In certain cases, outright bugs in complex code designed to tolerate many errors has the ironic effect of limiting the spec. For example, RSS is based on XML, but because of the existence of RSS feeds with invalid XML, liberal RSS parsers can’t be based on real XML parsers. Real XML parsers are thoroughly tested and widely deployed. But instead, the developers have to roll their own quasi-XML parsers (increasing the barrier to entry). The chance of getting some part of the XML spec wrong is high (making the software unreliable). This in turn has made feed developers reluctant at various times to begin using any XML features that don’t already appear in the most common feeds, such as CDATA blocks in the description element, namespaces, and XML comments, because they might break regexp-based parsers. (Mark Pilgrim’s Ultra Liberal Feed Parser is a solution for Python programmers, and while it gets everything right as far as I know, it still doesn’t much help developers in other languages.)
In this example, XML is special, because the XML spec itself violates Postel’s Law. It calls for clients to terminate parsing entirely when they encounter malformed content. While it may have been better if this decision hadn’t been made, that’s the current reality of XML parsers. Replacing them all with less flighty ones would be nice. (Any takers?)
Finally, security. A whole class of security vulnerabilities results from automatically fixing errors in input data. Because the set of errors to be fixed is ill-defined, software downstream can take a radically different action than what the software upstream thought possible. Malicious users can exploit this.
Think of the difficulty just of reliably filtering out dangerous HTML tags and attributes from a comment left on a web site. The browser is working as hard as possible to be liberal in its definition of an HTML tag, working by unknown rules to fix almost-tags. Can the author of such a filter ever be truly certain that nothing gets through? (Thinking about this, the only sure way around it without writing an entire HTML validator would be to fully parse the HTML input into an intermediate HTML-free representation, then write it back out as guaranteed-valid HTML code. The only thing left to worry about: an overzealous fix that would cause the valid code to be misinterpreted.)
The rule: Arbitrary fixes to bad input data will thwart any previous filtering or security checking of the data.
What to do?
Future specs could require implementations to report whenever they encounter and correct errors, with an interface that could be as simple and non-intrusive as an exclamation point icon. (A newsreader, for instance, would place it next to a suspect newsfeed and link it to the Feed Validator.) There’s nothing particularly new about this kind of interface; several products, such as Opera, already do something similar. The trick would be that that it would be required by the spec. The market leaders would be compelled to adopt it, not just the smaller products.
This behavior wouldn’t hamper a program’s ability to accept liberally; it would just let testers and other interested users know that the data had not been sent conservatively. It would thus remove the conflict of interest between the two parts of the law. The feature would be on by default, so testers wouldn’t need to activate it, but it wouldn’t be so annoying that users wanted it off (as a modal alert box would be).
This doesn’t mean that each implementation has to have a full-fleged validator aboard. Only errors detectable by reasonably straightforward means and cases where the implementation goes to extra lengths to make sense of the input would have to be flagged. That does give implementations some wiggle room.
It’s important that this minor error display mechanism be required in order to comply with the spec. It can’t be voluntary on the part of the implementors. There’s nothing in it for them, at least not directly. To record the error as it’s fixed and display the fact takes extra code, albeit not much. Considering that the benefit goes mainly to future implementors as well as users of less liberal implementations that don’t know how to handle the same error, implementors will tend not the write that code unless nudged.
And the developers can be nudged, even for specs without trademarks or an official logo program. Having the requirement enshrined in the spec at least provides some social pressure for implementors to comply.
And some other things that seem to make sense right now:
Got back recently from visiting family in the mild weather of Dartmouth, England, where it’s not cold, and it’s not warm, and where the camera flash usually goes off outside at noon. Some pictures to come later.
Random notes for Rhode Islanders:
90.3 FM (WRIU) makes a great change from 95.5 WBRU. Even after listening for more than a year, 90.3 still plays lots I haven’t heard before, and their unvarnished DJs are the polar opposites of the annoying ones that used to try way too hard at 99.7 "the edge" (now defunct). Unfortunately, being a small college station, they switch off completely during the summer.
I can’t quite say that you won’t catch them repeating anything, because one evening last week, they left the same CD shuffling in the player for hours. I heard the exact same songs (Erin McKeown, Distillation) on three separate short car trips over a two-and-a-half-hour period, starting at 6:30 PM. When I was listening at home after 9, a DJ came on and apologized for being a little late.
artinruins.com has lots of pictures and background info on the Providence buildings that will be torn down, or have already come down, as a result of the I-195 relocation project. I walk by Thurston Manufacturing (their link) and the former Providence Machine Company (their link, my photo) almost every weekday on the way to lunch.
I’m giving WordPress a spin, replacing my own experimental statically-generated weblog publishing tool. The homegrown system worked well, but I wanted to add more dynamic features such as comments and trackbacks, and there’s so much other work going on with weblogging tools that it wasn’t a good use of time to implement those myself.
So I made some changes to WordPress to make it fit my publishing system, all of which are to be contributed back to the project.