Fuzzy word search

Fuzzy word search in a database

In 1993, my colleague and I were asked to write in six weeks a C subroutine that finds all similar surnames in a database. In these six weeks we provided
- a method for matching phonetically similar words,
- a description of phonetics rules of a dozen of natural languages.
- a software toolkit, which I wrote in Prolog. The toolkit allowed to design and adjust descriptions of natural languages; on the base of these descriptions it generated
- the C routine itself. The descriptive part of that subroutine with the phonetics rules was generated by the toolkit.
Later the method was published in the journal on computer science of the Russian State Academy of Sciences - see below (the English translation is available from the authors).

The following is a fragment of the rules for Russian language with German as a target:

l l ll
k k ck h ch g c q
s s z ts ss zz
d t tt dt
1 sch sh tsch schtsch schtsh shtsch shtsh
1 tsh tch dsch dsh dch
1 stsch ztsch
?I i j y
?U @ $ # % these characters replace umlauts
?V ae oe ue
a j jA jI jU jV
b p pp
a Aj Ij Uj Vj
x ks gz kz gs cks chs hs
?J i j y
?A a e o u i y
?W U V
a AJA AJW WJA WJW AJJ WJJ JJA JJW
2 ksch cksch hsch chsch gsch cksh hsh chsh gsh ksh
1 tk tck th dk dck dh
?A a e o u
?I i j y
?H h ch g gh
a A I U AA AI IA UA UI IU Ah Uh ij ji yi iy aeue
a AAA AIA IAA AAI IIA UAA -aaa -eee -ooo -uuu
a AII -Aii -Ajj -Ayy
a AhA AAh IAh AhU UhA UhI AhI AAhI AAhA AhAA IAhA
a Ih IUh IhA UIU UIA UU jAhi iAhi ieh ihe IhI IAjA

Top

One approach to similar word matching

S. Diev, A. Rubin. "Programmirovanie", Moscow, Russia, 1994, N 6, pp.61-70.

Abstract
A rule-based method for phonetic matching of words is described that allows to overcome inexact word spelling. The method can be used to match words with different or erroneous spelling (such as surnames, geographic names, firm names, etc.) when searching in database.
The proposed technology 1) is database independent, 2) is opened with respect to natural languages, 3) can be used by an expert with no knowledge of programming.

Top