FIND SIMilar

Its uses

This findsim.com instance of my search engine searches music titles and their author names. Other instances searches phone books and address lists. It is very well suited for phone book searches and cleaning of faulty addresses and databases.

What it does

Guessing how a name is spelled after hearing it is often an impossible task. It is even more difficult when it was heard through a bad phone line, shouted through a noisy party, or read from some nearly indecipherable notes.

The same goes for any text, though names are typically the most difficult, especially if they are in an unfamiliar language.

How it does it

I made the search engine to solve this problem. It corrects probabilistically. It knows which letters sound similar, and how similar they actually sound. Same goes for letters that look similar, and for letters sitting close together on small phone keyboards. It also knows the kind of spelling mistakes dyslexics do, and the confounding of word parts we all do.

All this is used to calculate the probability P that each row in the database matches the search query. And with time it will learn even better to fix the spelling mistakes people make, through statistics.

Difficulty of doing it

There were 2 main difficulties in making this search engine.

The first was getting it to work at all. Names and similar text can be distorted in as many ways as there are atoms in the Universe, and the probabilites have to be right.

The second problem was to get it fast enough. It is now a very parallell system, using all available cores and processors efficiently. Decreasing the search time is mainly a question of getting bigger and more servers and faster memory. It currently runs on just a Phenom X4 2.2 GHz processor with 16 GB RAM. This is actually too little RAM, so I had to cut away about 1/3 of this public domain database from MusicBrainz. The database is expanded by a factor of about 50 with precalculations to make the search faster. This takes well under a minute, so its starting is quite fast, and it can be updated just as quickly.

Doing similar stuff

I also have similar systems for correcting numbers and codes, like those on packages and bills. I also make codes that are extra tough against errors. One such code system corrects any single error, and almost all double errors as well, including both missing and extra digits, and discovers 99.999% of all errors. They are also compatible with existing error detection schemes, and have adjustable toughness.

If I get the time, I have plans for similar search among images.

Bying it

I sell these search services. I can be reached by email, as shown below, or by phone +47 9001 4425.

Kim Øyhus, M.Sc. Physics
Company: Øyhus Information Technology.

Contact: