eBay, Misspellings and Google Meta Tags
Misspellings are far more common on the internet than most people think. These result from typographical errors, typing English words how they are pronounced rather than how they are spelled (something that can be compounded by regional and geographic variants). Then there are the differences between different international variations of English). Then there are confusions between homophones (words that sound the same, such as dire and dyer). Some of these errors arise because of mistyping neighbouring keys (where wealth becomes wealyh) others arise because certain sounds in English are similar (b and v for example or even r and l). Other errors occur because of character transposition (such as herd becoming hred) due to miss-typing or characters being dropped (such as search becoming serch) and characters being added (passed becoming passed).
Many of the common misspellings can be derived from the rules of language (which is where being an etymologist helps). Other instances can be derived from (for example) the differences between American and British English. Such lists are freely available (if rather hard to find). Using these it's quite possible to produce an application that derives these errors 'on the fly'. Indeed, it is this application that drives the Celtnet eBay Misspelling Tool. The code also forms the basis for the 'on the fly' calculations performed in the Celtnet MisspellingSearch tool for finding misspellings of words and short phrases. This tool can be used to create misspellings a site's keywords and it outputs a pre-formatted <META> tag to plug straight into your website.
Beyond the basic algorithm (implemented in PERL, with a PHP interface) comes the hard slog. There are a number of misspellings 'in the wild' that essentially defy the standard linguistic laws (in the main these are caused by local variants in spoken English or miss-typings due to right-hand and left-hand dominance). The only way to capture these and to create rules for generating them is to search for them. This necessitates writing a spider that searches for web pages with significant paragraphs written in English displayed (and requires that the pages crawled be 'balanced' in terms of technical pages and pages written in formal and colloquial English). Once the pages are brought back and the HTML tags have been stripped it's a question of passing the pages through a spell-checker (I'm using Aspell which has a PERL interface and handles International English variants) so that the misspellings can be identified. All misspelled words are placed in either one of two files; a file where the misspelling is obvious and a file where there are several possible interpretations for the misspelling. The first of these is amenable to automated analysis whilst the second needs manual intervention. Over many weeks this builds-up a view of the 'real' misspellings on the internet. Some of these misspellings are fed to a database whilst others are built into the Perl script that auto-generates misspellings. (The database I've constructed currently has 1356781 words with almost 2 billion misspellings and it's almost 100Mb in size).
I've now reached the 'diminishing returns' stage for these analyses as it's taking a trawl of almost 30 sites to find a new misspelling (though the other sites are providing useful data on misspelling frequency). As a result it's time to make the various misspelling systems available for others to use. There will probably be updates over the next few weeks, but these will be incremental improvements rather than major revolutions in functionality.
If you want to give these systems a whirl (and they're completely free to use) you can find them here:
Celtnet eBay Misspelling Tool to find misspelled eBay listings for eBay sites in he US, UK, Australia, India and Ireland.
Celtnet MisspellingSearch a generalized misspelling tool to aid with website meta-tag optimization