src - OpenBSD base system

diff options


context:
space:
mode:

author	Marc Espie <espie@cvs.openbsd.org>	2012-05-22 09:02:52 +0000
committer	Marc Espie <espie@cvs.openbsd.org>	2012-05-22 09:02:52 +0000
commit	8566c86bc2d81150b6046c563336cdb53206c52d (patch)
tree	ca45538284a9864341fd0d0011cea0dbde04a2a9
parent	7f0ff55c1e37fab526b9871d068724a989f58ca8 (diff)

import sqlite 3.7.12 (tested by landry@)

Diffstat

-rw-r--r--

lib/libsqlite3/src/test_spellfix.c

1891

1 files changed, 506 insertions, 1385 deletions

diff --git a/lib/libsqlite3/src/test_spellfix.c b/lib/libsqlite3/src/test_spellfix.c
index 3f21d732b68..5a221e0b1b0 100644
--- a/lib/libsqlite3/src/test_spellfix.c
+++ b/lib/libsqlite3/src/test_spellfix.c

@@ -10,9 +10,271 @@

*************************************************************************

-** This module implements the spellfix1 VIRTUAL TABLE that can be used

-** to search a large vocabulary for close matches. See separate

-** documentation files (spellfix1.wiki and editdist3.wiki) for details.

+** This module implements a VIRTUAL TABLE that can be used to search

+** a large vocabulary for close matches. For example, this virtual

+** table can be used to suggest corrections to misspelled words. Or,

+** it could be used with FTS4 to do full-text search using potentially

+** misspelled words.

+**

+** Create an instance of the virtual table this way:

+**

+** CREATE VIRTUAL TABLE demo USING spellfix1;

+**

+** The "spellfix1" term is the name of this module. The "demo" is the

+** name of the virtual table you will be creating. The table is initially

+** empty. You have to populate it with your vocabulary. Suppose you

+** have a list of words in a table named "big_vocabulary". Then do this:

+**

+** INSERT INTO demo(word) SELECT word FROM big_vocabulary;

+**

+** If you intend to use this virtual table in cooperation with an FTS4

+** table (for spelling correctly of search terms) then you can extract

+** the vocabulary using an fts3aux table:

+**

+** INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

+**

+** You can also provide the virtual table with a "rank" for each word.

+** The "rank" is an estimate of how common the word is. Larger numbers

+** mean the word is more common. If you omit the rank when populating

+** the table, then a rank of 1 is assumed. But if you have rank

+** information, you can supply it and the virtual table will show a

+** slight preference for selecting more commonly used terms. To

+** populate the rank from an fts4aux table "search_aux" do something

+** like this:

+**

+** INSERT INTO demo(word,rank)

+** SELECT term, documents FROM search_aux WHERE col='*';

+**

+** To query the virtual table, include a MATCH operator in the WHERE

+** clause. For example:

+**

+** SELECT word FROM demo WHERE word MATCH 'kennasaw';

+**

+** Using a dataset of American place names (derived from

+** http://geonames.usgs.gov/domestic/download_data.htm) the query above

+** returns 20 results beginning with:

+**

+** kennesaw

+** kenosha

+** kenesaw

+** kenaga

+** keanak

+**

+** If you append the character '*' to the end of the pattern, then

+** a prefix search is performed. For example:

+**

+** SELECT word FROM demo WHERE word MATCH 'kennes*';

+**

+** Yields 20 results beginning with:

+**

+** kennesaw

+** kennestone

+** kenneson

+** kenneys

+** keanes

+** keenes

+**

+** The virtual table actually has a unique rowid with five columns plus three

+** extra hidden columns. The columns are as follows:

+**

+** rowid A unique integer number associated with each

+** vocabulary item in the table. This can be used

+** as a foreign key on other tables in the database.

+**

+** word The text of the word that matches the pattern.

+** Both word and pattern can contains unicode characters

+** and can be mixed case.

+**

+** rank This is the rank of the word, as specified in the

+** original INSERT statement.

+**

+** distance This is an edit distance or Levensthein distance going

+** from the pattern to the word.

+**

+** langid This is the language-id of the word. All queries are

+** against a single language-id, which defaults to 0.

+** For any given query this value is the same on all rows.

+**

+** score The score is a combination of rank and distance. The

+** idea is that a lower score is better. The virtual table

+** attempts to find words with the lowest score and

+** by default (unless overridden by ORDER BY) returns

+** results in order of increasing score.

+**

+** top (HIDDEN) For any query, this value is the same on all

+** rows. It is an integer which is the maximum number of

+** rows that will be output. The actually number of rows

+** output might be less than this number, but it will never

+** be greater. The default value for top is 20, but that

+** can be changed for each query by including a term of

+** the form "top=N" in the WHERE clause of the query.

+**

+** scope (HIDDEN) For any query, this value is the same on all

+** rows. The scope is a measure of how widely the virtual

+** table looks for matching words. Smaller values of

+** scope cause a broader search. The scope is normally

+** choosen automatically and is capped at 4. Applications

+** can change the scope by including a term of the form

+** "scope=N" in the WHERE clause of the query. Increasing

+** the scope will make the query run faster, but will reduce

+** the possible corrections.

+**

+** srchcnt (HIDDEN) For any query, this value is the same on all

+** rows. This value is an integer which is the number of

+** of words examined using the edit-distance algorithm to

+** find the top matches that are ultimately displayed. This

+** value is for diagnostic use only.

+**

+** soundslike (HIDDEN) When inserting vocabulary entries, this field

+** can be set to an spelling that matches what the word

+** sounds like. See the DEALING WITH UNUSUAL AND DIFFICULT

+** SPELLINGS section below for details.

+**

+** When inserting into or updating the virtual table, only the rowid, word,

+** rank, and langid may be changes. Any attempt to set or modify the values

+** of distance, score, top, scope, or srchcnt is silently ignored.

+**

+** ALGORITHM

+**

+** A shadow table named "%_vocab" (where the % is replaced by the name of

+** the virtual table; Ex: "demo_vocab" for the "demo" virtual table) is

+** constructed with these columns:

+**

+** id The unique id (INTEGER PRIMARY KEY)

+**

+** rank The rank of word.

+**

+** langid The language id for this entry.

+**

+** word The original UTF8 text of the vocabulary word

+**

+** k1 The word transliterated into lower-case ASCII.

+** There is a standard table of mappings from non-ASCII

+** characters into ASCII. Examples: "æ" -> "ae",

+** "þ" -> "th", "ß" -> "ss", "á" -> "a", ... The

+** accessory function spellfix1_translit(X) will do

+** the non-ASCII to ASCII mapping. The built-in lower(X)

+** function will convert to lower-case. Thus:

+** k1 = lower(spellfix1_translit(word)).

+**

+** k2 This field holds a phonetic code derived from k1. Letters

+** that have similar sounds are mapped into the same symbol.

+** For example, all vowels and vowel clusters become the

+** single symbol "A". And the letters "p", "b", "f", and

+** "v" all become "B". All nasal sounds are represented

+** as "N". And so forth. The mapping is base on

+** ideas found in Soundex, Metaphone, and other

+** long-standing phonetic matching systems. This key can

+** be generated by the function spellfix1_charclass(X).

+** Hence: k2 = spellfix1_charclass(k1)

+**

+** There is also a function for computing the Wagner edit distance or the

+** Levenshtein distance between a pattern and a word. This function

+** is exposed as spellfix1_editdist(X,Y). The edit distance function

+** returns the "cost" of converting X into Y. Some transformations

+** cost more than others. Changing one vowel into a different vowel,

+** for example is relatively cheap, as is doubling a constant, or

+** omitting the second character of a double-constant. Other transformations

+** or more expensive. The idea is that the edit distance function returns

+** a low cost of words that are similar and a higher cost for words

+** that are futher apart. In this implementation, the maximum cost

+** of any single-character edit (delete, insert, or substitute) is 100,

+** with lower costs for some edits (such as transforming vowels).

+**

+** The "score" for a comparison is the edit distance between the pattern

+** and the word, adjusted down by the base-2 logorithm of the word rank.

+** For example, a match with distance 100 but rank 1000 would have a

+** score of 122 (= 100 - log2(1000) + 32) where as a match with distance

+** 100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).

+** (NB: The constant 32 is added to each score to keep it from going

+** negative in case the edit distance is zero.) In this way, frequently

+** used words get a slightly lower cost which tends to move them toward

+** the top of the list of alternative spellings.

+**

+** A straightforward implementation of a spelling corrector would be

+** to compare the search term against every word in the vocabulary

+** and select the 20 with the lowest scores. However, there will

+** typically be hundreds of thousands or millions of words in the

+** vocabulary, and so this approach is not fast enough.

+**

+** Suppose the term that is being spell-corrected is X. To limit

+** the search space, X is converted to a k2-like key using the

+** equivalent of:

+**

+** key = spellfix1_charclass(lower(spellfix1_translit(X)))

+**

+** This key is then limited to "scope" characters. The default scope

+** value is 4, but an alternative scope can be specified using the

+** "scope=N" term in the WHERE clause. After the key has been truncated,

+** the edit distance is run against every term in the vocabulary that

+** has a k2 value that begins with the abbreviated key.

+**

+** For example, suppose the input word is "Paskagula". The phonetic

+** key is "BACACALA" which is then truncated to 4 characters "BACA".

+** The edit distance is then run on the 4980 entries (out of

+** 272,597 entries total) of the vocabulary whose k2 values begin with

+** BACA, yielding "Pascagoula" as the best match.

+**

+** Only terms of the vocabulary with a matching langid are searched.

+** Hence, the same table can contain entries from multiple languages

+** and only the requested language will be used. The default langid

+** is 0.

+**

+** DEALING WITH UNUSUAL AND DIFFICULT SPELLINGS

+**

+** The algorithm above works quite well for most cases, but there are

+** exceptions. These exceptions can be dealt with by making additional

+** entries in the virtual table using the "soundslike" column.

+**

+** For example, many words of Greek origin begin with letters "ps" where

+** the "p" is silent. Ex: psalm, pseudonym, psoriasis, psyche. In

+** another example, many Scottish surnames can be spelled with an

+** initial "Mac" or "Mc". Thus, "MacKay" and "McKay" are both pronounced

+** the same.

+**

+** Accommodation can be made for words that are not spelled as they

+** sound by making additional entries into the virtual table for the

+** same word, but adding an alternative spelling in the "soundslike"

+** column. For example, the canonical entry for "psalm" would be this:

+**

+** INSERT INTO demo(word) VALUES('psalm');

+**

+** To enhance the ability to correct the spelling of "salm" into

+** "psalm", make an addition entry like this:

+**

+** INSERT INTO demo(word,soundslike) VALUES('psalm','salm');

+**

+** It is ok to make multiple entries for the same word as long as

+** each entry has a different soundslike value. Note that if no

+** soundslike value is specified, the soundslike defaults to the word

+** itself.

+**

+** Listed below are some cases where it might make sense to add additional

+** soundslike entries. The specific entries will depend on the application

+** and the target language.

+**

+** * Silent "p" in words beginning with "ps": psalm, psyche

+**

+** * Silent "p" in words beginning with "pn": pneumonia, pneumatic

+**

+** * Silent "p" in words beginning with "pt": pterodactyl, ptolemaic

+**

+** * Silent "d" in words beginning with "dj": djinn, Djikarta

+**

+** * Silent "k" in words beginning with "kn": knight, Knuthson

+**

+** * Silent "g" in words beginning with "gn": gnarly, gnome, gnat

+**

+** * "Mac" versus "Mc" beginning Scottish surnames

+**

+** * "Tch" sounds in Slavic words: Tchaikovsky vs. Chaykovsky

+**

+** * The letter "j" pronounced like "h" in Spanish: LaJolla

+**

+** * Words beginning with "wr" versus "r": write vs. rite

+**

+** * Miscellanous problem words such as "debt", "tsetse",

+** "Nguyen", "Van Nuyes".

#if SQLITE_CORE

# include "sqliteInt.h"

@@ -23,22 +285,21 @@

# include "sqlite3ext.h"

SQLITE_EXTENSION_INIT1

#endif /* !SQLITE_CORE */

-#include <ctype.h>

** Character classes for ASCII characters:

** 0 '' Silent letters: H W

** 1 'A' Any vowel: A E I O U (Y)

-** 2 'B' A bilabeal stop or fricative: B F P V W

+** 2 'B' A bilabeal stop or fricative: B F P V

** 3 'C' Other fricatives or back stops: C G J K Q S X Z

** 4 'D' Alveolar stops: D T

** 5 'H' Letter H at the beginning of a word

-** 6 'L' Glide: L

-** 7 'R' Semivowel: R

-** 8 'M' Nasals: M N

+** 6 'L' Glides: L R

+** 7 'M' Nasals: M N

+** 8 'W' Letter W at the beginning of a word

** 9 'Y' Letter Y at the beginning of a word.

-** 10 '9' Digits: 0 1 2 3 4 5 6 7 8 9

+** 10 '9' A digit: 0 1 2 3 4 5 6 7 8 9

** 11 ' ' White space

** 12 '?' Other.

@@ -49,8 +310,8 @@

#define CCLASS_D 4

#define CCLASS_H 5

#define CCLASS_L 6

-#define CCLASS_R 7

-#define CCLASS_M 8

+#define CCLASS_M 7

+#define CCLASS_W 8

#define CCLASS_Y 9

#define CCLASS_DIGIT 10

#define CCLASS_SPACE 11

@@ -61,177 +322,78 @@

** characters.

static const unsigned char midClass[] = {

- /* */ CCLASS_OTHER, /* */ CCLASS_OTHER, /* */ CCLASS_OTHER,

- /* */ CCLASS_SPACE, /* */ CCLASS_OTHER, /* */ CCLASS_OTHER,

- /* */ CCLASS_SPACE, /* */ CCLASS_SPACE, /* */ CCLASS_OTHER,

- /* */ CCLASS_OTHER, /* */ CCLASS_OTHER, /* */ CCLASS_OTHER,

- /* */ CCLASS_OTHER, /* */ CCLASS_OTHER, /* */ CCLASS_SPACE,

- /* ! */ CCLASS_OTHER, /* " */ CCLASS_OTHER, /* # */ CCLASS_OTHER,

- /* $ */ CCLASS_OTHER, /* % */ CCLASS_OTHER, /* & */ CCLASS_OTHER,

- /* ' */ CCLASS_SILENT, /* ( */ CCLASS_OTHER, /* ) */ CCLASS_OTHER,

- /* * */ CCLASS_OTHER, /* + */ CCLASS_OTHER, /* , */ CCLASS_OTHER,

- /* - */ CCLASS_OTHER, /* . */ CCLASS_OTHER, /* / */ CCLASS_OTHER,

- /* 0 */ CCLASS_DIGIT, /* 1 */ CCLASS_DIGIT, /* 2 */ CCLASS_DIGIT,

- /* 3 */ CCLASS_DIGIT, /* 4 */ CCLASS_DIGIT, /* 5 */ CCLASS_DIGIT,

- /* 6 */ CCLASS_DIGIT, /* 7 */ CCLASS_DIGIT, /* 8 */ CCLASS_DIGIT,

- /* 9 */ CCLASS_DIGIT, /* : */ CCLASS_OTHER, /* ; */ CCLASS_OTHER,

- /* < */ CCLASS_OTHER, /* = */ CCLASS_OTHER, /* > */ CCLASS_OTHER,

- /* ? */ CCLASS_OTHER, /* @ */ CCLASS_OTHER, /* A */ CCLASS_VOWEL,

- /* B */ CCLASS_B, /* C */ CCLASS_C, /* D */ CCLASS_D,

- /* E */ CCLASS_VOWEL, /* F */ CCLASS_B, /* G */ CCLASS_C,

- /* H */ CCLASS_SILENT, /* I */ CCLASS_VOWEL, /* J */ CCLASS_C,

- /* K */ CCLASS_C, /* L */ CCLASS_L, /* M */ CCLASS_M,

- /* N */ CCLASS_M, /* O */ CCLASS_VOWEL, /* P */ CCLASS_B,

- /* Q */ CCLASS_C, /* R */ CCLASS_R, /* S */ CCLASS_C,

- /* T */ CCLASS_D, /* U */ CCLASS_VOWEL, /* V */ CCLASS_B,

- /* W */ CCLASS_B, /* X */ CCLASS_C, /* Y */ CCLASS_VOWEL,

- /* Z */ CCLASS_C, /* [ */ CCLASS_OTHER, /* \ */ CCLASS_OTHER,

- /* ] */ CCLASS_OTHER, /* ^ */ CCLASS_OTHER, /* _ */ CCLASS_OTHER,

- /* ` */ CCLASS_OTHER, /* a */ CCLASS_VOWEL, /* b */ CCLASS_B,

- /* c */ CCLASS_C, /* d */ CCLASS_D, /* e */ CCLASS_VOWEL,

- /* f */ CCLASS_B, /* g */ CCLASS_C, /* h */ CCLASS_SILENT,

- /* i */ CCLASS_VOWEL, /* j */ CCLASS_C, /* k */ CCLASS_C,

- /* l */ CCLASS_L, /* m */ CCLASS_M, /* n */ CCLASS_M,

- /* o */ CCLASS_VOWEL, /* p */ CCLASS_B, /* q */ CCLASS_C,

- /* r */ CCLASS_R, /* s */ CCLASS_C, /* t */ CCLASS_D,

- /* u */ CCLASS_VOWEL, /* v */ CCLASS_B, /* w */ CCLASS_B,

- /* x */ CCLASS_C, /* y */ CCLASS_VOWEL, /* z */ CCLASS_C,

- /* { */ CCLASS_OTHER, /* | */ CCLASS_OTHER, /* } */ CCLASS_OTHER,

- /* ~ */ CCLASS_OTHER, /* */ CCLASS_OTHER,

+ /* x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xa xb xc xd xe xf */

+ /* 0x */ 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 12, 11, 12, 12, 12,

+ /* 1x */ 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,

+ /* 2x */ 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,

+ /* 3x */ 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12,

+ /* 4x */ 12, 1, 2, 3, 4, 1, 2, 3, 0, 1, 3, 3, 6, 7, 7, 1,

+ /* 5x */ 2, 3, 6, 3, 4, 1, 2, 0, 3, 1, 3, 12, 12, 12, 12, 12,

+ /* 6x */ 12, 1, 2, 3, 4, 1, 2, 3, 0, 1, 3, 3, 6, 7, 7, 1,

+ /* 7x */ 2, 3, 6, 3, 4, 1, 2, 0, 3, 1, 3, 12, 12, 12, 12, 12,

};

** This tables gives the character class for ASCII characters that form the

** initial character of a word. The only difference from midClass is with

** the letters H, W, and Y.

static const unsigned char initClass[] = {

- /* */ CCLASS_OTHER, /* */ CCLASS_OTHER, /* */ CCLASS_OTHER,

- /* */ CCLASS_SPACE, /* */ CCLASS_OTHER, /* */ CCLASS_OTHER,

- /* */ CCLASS_SPACE, /* */ CCLASS_SPACE, /* */ CCLASS_OTHER,

- /* */ CCLASS_OTHER, /* */ CCLASS_OTHER, /* */ CCLASS_OTHER,

- /* */ CCLASS_OTHER, /* */ CCLASS_OTHER, /* */ CCLASS_SPACE,

- /* ! */ CCLASS_OTHER, /* " */ CCLASS_OTHER, /* # */ CCLASS_OTHER,

- /* $ */ CCLASS_OTHER, /* % */ CCLASS_OTHER, /* & */ CCLASS_OTHER,

- /* ' */ CCLASS_OTHER, /* ( */ CCLASS_OTHER, /* ) */ CCLASS_OTHER,