Archive for the ‘nut and bolts’ Category

Automatically identifying (human) languages with code

Monday, June 8th, 2009

For a recent project, we needed to automatically tell the difference between text that was written in French, German and English inside Word documents.

The simplest way of doing this is by checking the language attribute that’s been set on the style inside Word; unfortunately, very few Word users use the language value for styles correctly, or even use styles at all.

So, if we couldn’t trust the styles, we needed a mechanism that worked based on the text only. The first thing we tried was to identify some characteristic French, English and German words (like “des”,”und”,”für”,”and”), and check the text to see if it contained those words. The highest count of these distinctive words in a text determine which language it is likely to be.

This worked well, but we couldn’t be sure that the words would always appear in the text we were analyzing. So, we switched to an n-gram approach, as described in the thesis Evaluation of Language Identification Methods. This works by creating a “fingerprint” for the text, based on the occurrence of bigrams (”un”,”an”) and trigrams (”und”). It then compares this fingerprint to standard fingerprints for the various languages to find the one that it most resembles.

This gives better results when there is not much text available for analysis.

We’ve released the source code for this utility under an Open Source license. It’s written in Scala, an object-functional language that compiles to Java-compatible bytecode.

Increasing accessibility of radio buttons and checkboxes on forms

Saturday, February 21st, 2009

If you’ve ever tried to use the keyboard to navigate around a form on a webpage, you may have noticed that it’s often very hard to see which form item is currently selected. With most form elements, this isn’t too hard to fix - you can add a border around the currently selected textbox with CSS, for example. But, radio buttons and check boxes are both very hard to make visible.

Following on from some accessibility work that we’ve been doing for a client, we’ve developed a JQuery JavaScript plugin that helps fix this problem, and helps make web forms more accessible. We’ve released this as Open Source, and we’ve called it the JQuery labelFocus plugin.

Microsoft takes on GWT

Thursday, December 6th, 2007

It’s been a problem for a while that developers of web applications need to use a language like JavaScript on the web client, and another language like Java or C# or Python on the server. One popular attempt to fix this is Google’s GWT, and there have been other less mainstream options like ParenScript for Lisp and Links.

Now, Microsoft is launching another contender in the same space: Volta.

The post is somewhat obscure, but it’s essentially a beta version of a GWT competitor for .NET. You use annotations to mark chunks of code to be run on the client-side or server-side, and they’re compiled behind the scenes to JavaScript and deployed. There’s a debugger and profiler for the client-side code too.

An interesting feature about it, is that it works on MSIL (the .NET bytecode) rather than on the language syntax (as GWT does). Therefore, you should be able to use the more functional .NET languages with it - F#, for instance, is an ML implementation for .NET that appears well supported by MS. For that matter, C# 3 is already among the most functional mainstream languages.

The beta version is only available for Visual Studio.NET 2008 - currently available if you have an MSDN subscription, but not yet available for purchase.

Setting up a website using WordPress

Friday, August 17th, 2007

I’ve set the 67 Bricks website up using WordPress as a content management system. Previously, I’ve used either simple static HTML pages, or a traditional, full-featured CMS. I decided to use WordPress here because it is easy to use, and has an ecology of templates written for it making site design much simpler. It also describes itself as a “Semantic Publishing Platform”, and for a knowledge-management company such as our own, semantic publishing is important.

Setting up WordPress requires very little technical knowledge: my ISP, Heart Internet allows you to do it with a few clicks in their script library, but even without that, it’s still simple. The changes that I’ve made to make it work as a CMS are:

  • Installing the Navigo plugin: which makes it easy to create a menu
  • Installing Search Everything: so pages can be searched as well as posts
  • Installing WP Last Posts: to put the text of the last few posts on the home page

Then, I created the bulk of the site using WordPress “pages”, and the news items as “posts”. I set the front page to be a static page under the Options | Reading menu.

To create a design for the site, I looked through the library of templates on the WordPress site, downloaded four or five that looked interesting, and tried them out locally. Having found a template with a good layout, clean HTML code, and a reasonable license, I then customized it to remove the sections I didn’t like, and to change the graphics - a much simpler process than building a web site design from scratch.

That’s really all there is to it.

So, what did I learn?

Setting up a website with WordPress is very quick. Installation of WordPress took a few minutes, installing plugins took maybe quarter of an hour, choosing and customizing a theme about an hour, and the rest of the time was spent creating content.

The WordPress way of separating content from presentation using PHP functions works well. This is more powerful than the pure CSS approach of CSS Zen Garden, since it allows different templates to display different content. An alternative would be to use XQuery: which would probably be more versatile, but the PHP developer community hasn’t had a wide take-up of XQuery and seems generally sceptical of XML.

(thanks to twenty3design for help with choosing WordPress plugins)