From 0a193e032ba1ecf3f003e027e833dc9d274cb740 Mon Sep 17 00:00:00 2001 From: Kaleb Keithley Date: Fri, 14 Nov 2003 16:49:22 +0000 Subject: Initial revision --- lisp/re/README | 121 +++ lisp/re/re.c | 2648 +++++++++++++++++++++++++++++++++++++++++++++++++++++ lisp/re/re.h | 123 +++ lisp/re/rec.c | 1015 ++++++++++++++++++++ lisp/re/reo.c | 685 ++++++++++++++ lisp/re/rep.h | 369 ++++++++ lisp/re/tests.c | 199 ++++ lisp/re/tests.txt | 461 ++++++++++ 8 files changed, 5621 insertions(+) create mode 100644 lisp/re/README create mode 100644 lisp/re/re.c create mode 100644 lisp/re/re.h create mode 100644 lisp/re/rec.c create mode 100644 lisp/re/reo.c create mode 100644 lisp/re/rep.h create mode 100644 lisp/re/tests.c create mode 100644 lisp/re/tests.txt (limited to 'lisp/re') diff --git a/lisp/re/README b/lisp/re/README new file mode 100644 index 0000000..848e1e9 --- /dev/null +++ b/lisp/re/README @@ -0,0 +1,121 @@ +$XFree86: xc/programs/xedit/lisp/re/README,v 1.4 2002/11/15 07:01:32 paulo Exp $ + +LAST UPDATED: $Date$ + + This is a small regex library for fast matching tokens in text. It was built +to be used by xedit and it's syntax highlight code. It is not compliant with +IEEE Std 1003.2, but is expected to be used where very fast matching is +required, and exotic patterns will not be used. + + To understand what kind of patterns this library is expected to be used with, +see the file xc/programs/xedit/lisp/modules/progmodes/c.lsp and some +samples in the file tests.txt, with comments for patterns that will not work, +or may give incorrect results. + + The library is not built upon the standard regex library by Henry Spencer, +but is completely written from scratch, but it's syntax is heavily based on +that library, and the only reason for it to exist is that unfortunately +the standard version does not fit the requirements needed by xedit. +Anyways, I would like to thanks Henry for his regex library, it is a really +very useful tool. + + Small description of understood tokens: + + M A T C H I N G +------------------------------------------------------------------------ +. Any character (won't match newline if compiled with RE_NEWLINE) +\w Any word letter (shortcut to [a-zA-Z0-9_] +\W Not a word letter (shortcut to [^a-zA-Z0-9_] +\d Decimal number +\D Not a decimal number +\s A space +\S Not a space +\l A lower case letter +\u An upper case letter +\c A control character, currently the range 1-32 (minus tab) +\C Not a control character +\o Octal number +\O Not an octal number +\x Hexadecimal number +\X Not an hexadecimal number +\< Beginning of a word (matches an empty string) +\> End of a word (matches an empty string) +^ Beginning of a line (matches an empty string) +$ End of a line (matches an empty string) +[...] Matches one of the characters inside the brackets + ranges are specified separating two characters with "-". + If the first character is "^", matches only if the + character is not in this range. To add a "]" make it + the first character, and to add a "-" make it the last. +\1 to \9 Backreference, matches the text that was matched by a group, + that is, text that was matched by the pattern inside + "(" and ")". + + + O P E R A T O R S +------------------------------------------------------------------------ +() Any pattern inside works as a backreference, and is also + used to group patterns. +| Alternation, allows choosing different possibilities, like + character ranges, but allows patterns of different lengths. + + + R E P E T I T I O N +------------------------------------------------------------------------ +* may occur any number of times, including zero ++ must occur at least once +? is optional +{} must occur exactly times +{,} must occur at least times +{,} must not occur more than times +{,} must occur at least times, but no more than + + + Note that "." is a special character, and when used with a repetition +operator it changes completely its meaning. For example, ".*" matches +anything up to the end of the input string (unless the pattern was compiled +with RE_NEWLINE, in that case it will match anything, but a newline). + + + Limitations: + +o Only minimal matches supported. The engine has only one level "backtracking", + so, it also only does minimal matches to allow backreferences working + properly, and to avoid failing to match depending on the input. + +o Only one level "grouping", for example, with the pattern: + (a(b)c) + If "abc" is anywhere in the input, it will be in "\1", but there will + not exist a "\2" for "b". + +o Some "special repetitions" were not implemented, these are: + .{} + .{,} + .{,} + .{,} + +o Some patterns will never match, for example: + \w*\d + Since "\w*" already includes all possible matches of "\d", "\d" will + only be tested when "\w*" failed. There are no plans to make such + patterns work. + + + Some of these limitations may be worked on future versions of the library, +but this is not what the library is expected to do, and, adding support for +correct handling of these would probably make the library slower, what is +not the reason of it to exist in the first time. + + If you need "true" regex than this library is not for you, but if all +you need is support for very quickly finding simple patterns, than this +library can be a very powerful tool, on some patterns it can run more +than 200 times faster than "true" regex implementations! And this is +the reason it was written. + + + + Send comments and code to me (paulo@XFree86.Org) or to the XFree86 +mailing/patch lists. + +-- +Paulo diff --git a/lisp/re/re.c b/lisp/re/re.c new file mode 100644 index 0000000..d848a4b --- /dev/null +++ b/lisp/re/re.c @@ -0,0 +1,2648 @@ +/* + * Copyright (c) 2002 by The XFree86 Project, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF + * OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Except as contained in this notice, the name of the XFree86 Project shall + * not be used in advertising or otherwise to promote the sale, use or other + * dealings in this Software without prior written authorization from the + * XFree86 Project. + * + * Author: Paulo César Pereira de Andrade + */ + +/* $XFree86: xc/programs/xedit/lisp/re/re.c,v 1.9 2002/12/11 04:44:28 paulo Exp $ */ + +#include +#include "rep.h" +#define DEBUG +/* + * Types + */ + +/* Information used when generating the final form of the compiled re. + */ +struct _re_inf { + rec_alt *alt; + unsigned char *cod; + long len; + long spc; + + /* Start offset of special repetition instruction */ + long sr[MAX_DEPTH]; + + /* Jump offset of special repetition instruction */ + long sj[MAX_DEPTH]; + + /* Just a flag, to know if this nesting is for a special repetition */ + char sp[MAX_DEPTH]; + + int bas; /* Alternatives/repetitions depth */ + int par; /* Open parenthesis counter */ + int ref; /* Backreference counter */ + + rec_pat *apat; /* Alternatives duplicate patterns + * if a special repetition is found, + * this is done to somewhat simplify + * the bytecode engine and still allow + * some complex (and time consuming) + * patterns. */ + + int flags; + int ecode; +}; + +/* This structure is not associated with re_cod as it's data only matters + * to the current match search. + */ +struct _re_eng { + unsigned char *bas; /* Base string pointer */ + unsigned char *str; /* String to search for pattern */ + unsigned char *end; /* Where to stop searching */ + unsigned char *cod; /* Pointer in the re_cod structure */ + long off; /* Number of used entries in so/eo etc */ + + /* Match offset/nesting information */ + long so[MAX_DEPTH]; /* (s)tart of (m)atch */ + long eo[MAX_DEPTH]; /* (e)nd of (m)atch */ + long sv[MAX_DEPTH]; /* (s)a(v)e match end offset */ + long re[MAX_DEPTH]; /* (re)petition count */ + long ss[MAX_DEPTH]; /* (s)ave (s)tart of match */ + unsigned char *rcod[MAX_DEPTH]; /* restart position in regex code */ + unsigned char *rstr[MAX_DEPTH]; /* restart position in string */ + + /* Group/backreference information */ + long goff; + long gso[9]; + long geo[9]; +}; + +/* + * Prototypes + */ +static void reinit(void); +static int rec_check(re_inf*, int); +static int rec_code(re_inf*, ReCode); +static int rec_byte(re_inf*, int); +static int rec_byte_byte(re_inf*, int, int); +static int rec_code_byte(re_inf*, ReCode, int); +static int rec_length(re_inf*, int); +static int rec_code_byte_byte(re_inf*, ReCode, int, int); +static int rec_build_alt(re_inf*, rec_alt*); +static int rec_build_pat(re_inf*, rec_pat*); +static int rec_build_rng(re_inf*, rec_rng*); +static int rec_build_grp(re_inf*, rec_grp*); +static int rec_build_stl(re_inf*, rec_stl*); +static int rec_build_rep(re_inf*, rec_rep*); +static int rec_inc_spc(re_inf*); +static int rec_dec_spc(re_inf*); +static int rec_add_spc(re_inf*, int); +static int rec_off_spc(re_inf*); +static int rec_alt_spc(re_inf*, int); +static int rec_rep_spc(re_inf*, int); +#ifdef DEBUG +static void redump(re_cod*); +#endif + +/* + * Initialization + */ +unsigned char re__alnum[256]; +unsigned char re__odigit[256]; +unsigned char re__ddigit[256]; +unsigned char re__xdigit[256]; +unsigned char re__control[256]; + +/* + * Implementation + */ +int +recomp(re_cod *preg, const char *pattern, int flags) +{ + int i, ecode; + re_inf inf; + + reinit(); + + preg->cod = NULL; + inf.alt = irec_comp(pattern, + flags & RE_PEND ? preg->re_endp : + pattern + strlen(pattern), + flags, &ecode); + if (ecode != 0) + return (ecode); + + inf.cod = NULL; + inf.len = inf.spc = 0; + inf.bas = 0; + inf.par = 0; + inf.ref = 0; + inf.apat = NULL; + inf.flags = flags; + inf.ecode = 0; + for (i = 0; i < MAX_DEPTH; i++) + inf.sp[i] = 0; + + /* First byte is runtime modifier flags */ + if (rec_byte(&inf, flags & (RE_NEWLINE | RE_NOSUB)) == 0 && + rec_byte(&inf, 0xff) == 0 && + rec_build_alt(&inf, inf.alt) == 0 && + rec_rep_spc(&inf, 0) == 0 && + rec_code(&inf, Re_Done) == 0) { + /* Number of possible references, loops will not leave this + * value correct, but it is cheap to read it from the second + * byte, instead of adding several extra checks in the bytecode. */ + if (inf.ref) + inf.cod[1] = inf.ref - 1; + preg->cod = inf.cod; + /* Public structure member */ + preg->re_nsub = inf.ref; + } + + irec_free_alt(inf.alt); + if (inf.ecode) + free(inf.cod); +#ifdef DEBUG + else if (flags & RE_DUMP) + redump(preg); +#endif + + return (inf.ecode); +} + +int +reexec(const re_cod *preg, const char *string, + int nmatch, re_mat pmat[], int flags) +{ + unsigned char *ptr, *str, newline, nosub; + int len, si, ci, bas, i, j, k, l, m; + re_eng eng; + + if (preg == NULL || preg->cod == NULL || nmatch < 0 || + ((flags & RE_STARTEND) && + (pmat == NULL || pmat[0].rm_eo < pmat[0].rm_so))) + return (RE_INVARG); + + eng.str = (unsigned char*)string; + if (flags & RE_STARTEND) { + eng.end = eng.str + pmat[0].rm_eo; + eng.str += pmat[0].rm_so; + } + else + eng.end = eng.str + strlen(string); + eng.bas = eng.str; + nosub = preg->cod[0] & RE_NOSUB; + newline = preg->cod[0] & RE_NEWLINE; + eng.cod = preg->cod + 2; + + if (!nosub && preg->cod[1] != 0xff) { + for (i = 0; i <= preg->cod[1]; i++) { + eng.gso[i] = 0; + eng.geo[i] = -1; + } + } + + /* Setup to search for start of match from the first character */ + eng.so[0] = 0; + eng.eo[0] = eng.sv[0] = -1; + eng.rcod[0] = eng.cod; + eng.rstr[0] = eng.str + 1; + eng.off = 0; + eng.goff = -1; + for (ci = si = 1;;) { +reset: + switch (*eng.cod) { + /**************************************************** + * One byte codes * + ****************************************************/ + case Re_Any: + if (eng.str == eng.end || (newline && eng.str[0] == '\n')) + goto fail; + goto match; + case Re_AnyEatAnyTimes: + if (newline) { + for (ptr = eng.str; ptr < eng.end; ptr++) { + if (*ptr == '\n') + break; + } + si = ptr - eng.str; + } + else + si = eng.end - eng.str; + goto match; + case Re_AnyEatMaybe: + si = eng.end > eng.str; + if (newline && si && eng.str[0] == '\n') + si = 0; + goto match; + case Re_AnyEatAtLeast: + if (newline) { + for (ptr = eng.str; ptr < eng.end; ptr++) { + if (*ptr == '\n') + break; + } + si = ptr - eng.str; + } + else + si = eng.end - eng.str; + if (si == 0) { + si = 1; + goto fail; + } + goto match; + case Re_Odigit: + if (eng.str >= eng.end) + goto fail; + if (re__odigit[eng.str[0]]) + goto match; + goto fail; + case Re_OdigitNot: + if (eng.str >= eng.end || re__odigit[eng.str[0]]) + goto fail; + goto match; + case Re_Digit: + if (eng.str >= eng.end) + goto fail; + if (re__ddigit[eng.str[0]]) + goto match; + goto fail; + case Re_DigitNot: + if (eng.str >= eng.end || re__ddigit[eng.str[0]]) + goto fail; + goto match; + case Re_Xdigit: + if (eng.str >= eng.end) + goto fail; + if (re__xdigit[eng.str[0]]) + goto match; + goto fail; + case Re_XdigitNot: + if (eng.str >= eng.end || re__xdigit[eng.str[0]]) + goto fail; + goto match; + case Re_Space: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] == ' ' || eng.str[0] == '\t') + goto match; + goto fail; + case Re_SpaceNot: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] != ' ' && eng.str[0] != '\t') + goto match; + goto fail; + case Re_Tab: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] == '\t') + goto match; + goto fail; + case Re_Newline: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] == '\n') + goto match; + goto fail; + case Re_Lower: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] >= 'a' && eng.str[0] <= 'z') + goto match; + goto fail; + case Re_Upper: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] >= 'A' && eng.str[0] <= 'Z') + goto match; + goto fail; + case Re_Alnum: + if (eng.str >= eng.end) + goto fail; + if (re__alnum[eng.str[0]]) + goto match; + goto fail; + case Re_AlnumNot: + if (eng.str >= eng.end) + goto fail; + if (re__alnum[eng.str[0]]) + goto fail; + goto match; + case Re_Control: + if (eng.str >= eng.end) + goto fail; + if (re__control[eng.str[0]]) + goto match; + goto fail; + case Re_ControlNot: + if (eng.str >= eng.end || re__control[eng.str[0]]) + goto fail; + goto match; + + /**************************************************** + * One byte codes, match special emtpy strings * + ****************************************************/ + case Re_Bol: + if (eng.str == eng.bas) { + if ((flags & RE_NOTBOL)) { + /* String does not start at the beginning of a line */ + if (newline) + goto fail; + goto wont; + } + si = 0; + goto match; + } + if (newline && eng.str[-1] == '\n') { + si = 0; + goto match; + } + goto fail; + case Re_Eol: + if (eng.str == eng.end) { + if (flags & RE_NOTEOL) + /* String does not finish at the end of a line */ + goto wont; + si = 0; + goto match; + } + if (newline && eng.str[0] == '\n') { + si = 0; + goto match; + } + goto fail; + case Re_Bow: + if (eng.str >= eng.end || + (eng.str > eng.bas && + (re__alnum[eng.str[-1]]))) + goto fail; + if (re__alnum[eng.str[0]]) { + si = 0; + goto match; + } + goto fail; + case Re_Eow: + if (eng.str == eng.bas || + (eng.str < eng.end && + re__alnum[eng.str[0]])) + goto fail; + if (re__alnum[eng.str[-1]]) { + si = 0; + goto match; + } + goto fail; + + /**************************************************** + * One byte code, one byte argument * + ****************************************************/ + case Re_Literal: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] == eng.cod[1]) { + ci = 2; + goto match; + } + goto fail; + case Re_LiteralNot: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] != eng.cod[1]) { + ci = 2; + goto match; + } + goto fail; + case Re_SearchLiteral: + for (str = eng.str; str < eng.end; str++) { + if (*str == eng.cod[1]) { + ci = 2; + eng.str = str; + goto match; + } + } + /* This bytecode only happens in the toplevel */ + eng.so[0] = str - eng.bas; + eng.str = str; + goto fail; + + /**************************************************** + * One byte code, two bytes argument * + ****************************************************/ + case Re_CaseLiteral: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] == eng.cod[1] || eng.str[0] == eng.cod[2]) { + ci = 3; + goto match; + } + goto fail; + case Re_CaseLiteralNot: + if (eng.str >= eng.end) + goto fail; + if (eng.str[0] != eng.cod[1] && eng.str[0] != eng.cod[2]) { + ci = 3; + goto match; + } + goto fail; + case Re_SearchCaseLiteral: + for (str = eng.str; str < eng.end; str++) { + if (*str == eng.cod[1] || *str == eng.cod[2]) { + ci = 3; + eng.str = str; + goto match; + } + } + eng.so[0] = str - eng.bas; + eng.str = str; + goto fail; + + /**************************************************** + * One byte codes, two arguments, n bytes * + ****************************************************/ + case Re_String: + len = eng.cod[1]; + if (len & 0x80) { + i = 3; + len = (len & 0x7f) + (eng.cod[2] << 7); + } + else + i = 2; + if (eng.end - eng.str < len) + goto fail; + ptr = eng.cod + i; + str = eng.str; + for (k = len; k > 0; k--) { + if (*ptr++ != *str++) + goto fail; + } + ci = i + len; + si = len; + goto match; + case Re_SearchString: + len = eng.cod[1]; + if (len & 0x80) { + i = 3; + len = (len & 0x7f) + (eng.cod[2] << 7); + } + else + i = 2; + for (str = eng.str; eng.end - str >= len; str = eng.str++) { + for (ptr = eng.cod + i, str = eng.str, k = len; k > 0; k--) + if (*ptr++ != *str++) + break; + if (k == 0) { + /* Substring found */ + ci = i + len; + si = str - eng.str; + goto match; + } + } + eng.so[0] = eng.end - eng.bas; + eng.str = eng.end; + goto fail; + + case Re_CaseString: + len = eng.cod[1]; + if (len & 0x80) { + i = 3; + len = (len & 0x7f) + (eng.cod[2] << 7); + } + else + i = 2; + + len >>= 1; + /* Check if there are at least len/2 bytes left, string + * is represented as two bytes, lower and upper case */ + if (eng.end - eng.str < len) + goto fail; + ptr = eng.cod + i; + str = eng.str; + for (k = len; k > 0; str++, ptr += 2, k--) { + if (*str != ptr[0] && *str != ptr[1]) + goto fail; + } + ci = i + (len << 1); + si = len; + goto match; + case Re_SearchCaseString: + len = eng.cod[1]; + if (len & 0x80) { + i = 3; + len = (len & 0x7f) + (eng.cod[2] << 7); + } + else + i = 2; + len >>= 1; + for (str = eng.str; eng.end - str >= len; str = eng.str++) { + for (ptr = eng.cod + i, str = eng.str, k = len; + k > 0; k--, ptr += 2, str++) + if (ptr[0] != str[0] && ptr[1] != str[0]) + break; + if (k == 0) { + /* Substring found */ + ci = i + (len << 1); + si = str - eng.str; + goto match; + } + } + eng.so[0] = eng.end - eng.bas; + eng.str = eng.end; + goto fail; + + case Re_StringList: + /* Number of strings */ + k = eng.cod[1]; + + /* Where to jump after match */ + bas = eng.cod[2] | (eng.cod[3] << 8); + + str = eng.str; + ptr = eng.cod + k + 4; + l = eng.end - eng.str; + for (j = 0; j < k; j++) { + len = eng.cod[j + 4]; + if (len <= l) { + for (i = 0; i < len; i++) + if (ptr[i] != str[i]) + goto next_stl; + goto stl_match; + } +next_stl: + ptr += len; + } + goto fail; +stl_match: + ci = bas; + si = len; + goto match; + + case Re_CaseStringList: + /* Number of strings */ + k = eng.cod[1]; + + /* Where to jump after match */ + bas = eng.cod[2] | (eng.cod[3] << 8); + + str = eng.str; + ptr = eng.cod + k + 4; + l = eng.end - eng.str; + for (j = 0; j < k; j++) { + len = eng.cod[j + 4]; + if ((len >> 1) <= l) { + for (i = m = 0; i < len; m++, i += 2) + if (ptr[i] != str[m] && ptr[i + 1] != str[m]) + goto next_cstl; + goto cstl_match; + } +next_cstl: + ptr += len; + } + goto fail; +cstl_match: + ci = bas; + si = len >> 1; + goto match; + + + case Re_LargeStringList: + /* Where to jump after match */ + bas = eng.cod[1] | (eng.cod[2] << 8); + + str = eng.str; + + /* First entry in index map */ + ptr = eng.cod + 3; + i = (int)str[0] << 1; + j = ptr[i] | (ptr[i + 1] << 8); + if (j == 0xffff) + /* No entry with this byte */ + goto fail; + + /* Bytes left in input */ + l = eng.end - eng.str; + + /* First entry matching initial byte */ + ptr += 512 + j; + + for (len = ptr[0]; + str[0] == ptr[1]; + ptr += len + 1, len = ptr[0]) { + if (len <= l) { + for (i = 1; i < len; i++) { + if (ptr[i + 1] != str[i]) + goto next_lstl; + } + ci = bas; + si = len; + goto match; + } +next_lstl:; + } + goto fail; + + case Re_LargeCaseStringList: + /* Where to jump after match */ + bas = eng.cod[1] | (eng.cod[2] << 8); + + str = eng.str; + + /* First entry in index map */ + ptr = eng.cod + 3; + i = (int)str[0] << 1; + j = ptr[i] | (ptr[i + 1] << 8); + if (j == 0xffff) + /* No entry with this byte */ + goto fail; + + /* Bytes left in input */ + l = eng.end - eng.str; + + /* First entry matching initial byte */ + ptr += 512 + j; + + for (len = ptr[0]; + str[0] == ptr[1] || str[0] == ptr[2]; + ptr += len + 1, len = ptr[0]) { + if ((k = (len >> 1)) <= l) { + for (i = 2, j = 1; i < len; i += 2, j++) { + if (ptr[i + 1] != str[j] && ptr[i + 2] != str[j]) + goto next_lcstl; + } + ci = bas; + si = k; + goto match; + } +next_lcstl:; + } + goto fail; + + + /**************************************************** + * Character range matching * + ****************************************************/ + case Re_Range: + if (eng.str < eng.end && eng.cod[eng.str[0] + 1]) { + ci = 257; + goto match; + } + goto fail; + case Re_RangeNot: + if (eng.str >= eng.end || eng.cod[eng.str[0] + 1]) + goto fail; + ci = 257; + goto match; + + /**************************************************** + * Group handling * + ****************************************************/ + case Re_Open: + if (++eng.goff >= 9) + return (RE_ASSERT); + eng.gso[eng.goff] = eng.str - eng.bas; + ++eng.cod; + continue; + case Re_Close: + eng.geo[eng.goff] = eng.str - eng.bas; + ++eng.cod; + continue; + case Re_Update: + bas = eng.cod[1]; + eng.geo[eng.goff] = eng.str - eng.bas; + eng.cod += 2; /* + Update + bas */ + continue; + + /**************************************************** + * Backreference * + ****************************************************/ + case Re_Backref: + i = eng.cod[1]; + j = eng.gso[i]; + k = eng.geo[i]; + len = k - j; + if (k < j || eng.end - eng.str < len) + goto fail; + ptr = eng.bas + j; + str = eng.str; + for (l = len; l > 0; l--) { + if (*ptr++ != *str++) + goto fail; + } + ci = 2; + si = len; + goto match; + + /**************************************************** + * Alternatives handling * + ****************************************************/ + case Re_Alt: + bas = eng.off; + if (++eng.off >= MAX_DEPTH) + return (RE_ASSERT); + + /* Get offset of next alternative */ + i = eng.cod[1] | (eng.cod[2] << 8); + + /* Setup for next alternative if the current fails */ + eng.rcod[eng.off] = eng.cod + i + 1; /* + Alt */ + + /* If fail, test the next alternative in the same string */ + eng.rstr[eng.off] = eng.str; + + /* Setup match offsets */ + if (eng.so[bas] <= eng.eo[bas]) + eng.so[eng.off] = eng.eo[bas]; + else + eng.so[eng.off] = eng.so[bas]; + eng.sv[eng.off] = eng.eo[eng.off] = eng.so[eng.off] - 1; + + /* Save start of possible previous matches */ + eng.ss[eng.off] = eng.so[bas]; + + /* Skip code */ + eng.cod += 3; + + /* Go try the first alternative */ + continue; + + case Re_AltNext: + bas = eng.off - 1; + /* Check if matched and if it is a better match */ + if (eng.sv[eng.off] - eng.so[eng.off] < + eng.eo[eng.off] - eng.so[eng.off]) + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Get offset of next alternative */ + i = eng.cod[1] | (eng.cod[2] << 8); + + /* Setup for next alternative if the current fails */ + eng.rcod[eng.off] = eng.cod + i + 1; /* + AltNext */ + + /* Setup match offset */ + eng.eo[eng.off] = eng.so[eng.off] - 1; + + /* Reset string for next alternative */ + eng.str = eng.rstr[eng.off]; + + /* Skip code */ + eng.cod += 3; + + /* Go try the next alternative */ + continue; + + case Re_AltDone: + bas = eng.off - 1; + /* Check if matched and if it is a better match */ + if (eng.sv[eng.off] - eng.so[eng.off] < + eng.eo[eng.off] - eng.so[eng.off]) + eng.sv[eng.off] = eng.eo[eng.off]; + + /* If there is a match */ + if (eng.sv[eng.off] >= eng.so[eng.off]) { + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + + /* Pop stack and skip code */ + --eng.off; + ++eng.cod; + + /* Go try next regular expression pattern */ + continue; + } + + /* Failed, reset string and pop stack */ + eng.str = eng.rstr[eng.off]; + --eng.off; + goto fail; + + + /**************************************************** + * Repetition * + ****************************************************/ + + /* Note that the repetition counter is not + * updated for *, + and ? + * it is easy to updated, but since it is not + * really useful, code to do it was removed + * to save a few cpu cicles. */ + +#define REPETITION_SETUP() \ + if (++eng.off >= MAX_DEPTH) \ + return (RE_ASSERT); \ + \ + /* Return here for recovery if match fail */ \ + eng.rcod[eng.off] = eng.cod; \ + \ + /* Setup match offsets */ \ + if (eng.so[bas] <= eng.eo[bas]) \ + eng.so[eng.off] = eng.eo[bas]; \ + else \ + eng.so[eng.off] = eng.so[bas]; \ + eng.ss[eng.off] = eng.so[bas]; \ + eng.sv[eng.off] = eng.eo[eng.off] = eng.so[eng.off] - 1;\ + \ + /* Skip repetition instruction */ \ + eng.cod += 4; + + + case Re_AnyTimes: + bas = eng.cod[1]; + if (eng.off == bas) { + /* First iteration */ + REPETITION_SETUP(); + } + else { + if (eng.eo[eng.off] >= eng.so[eng.off] && + eng.eo[eng.off] > eng.sv[eng.off]) { + /* Update offset of match */ + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Skip repetition instruction */ + eng.cod += 4; + } + else { + /* Match failed but it is ok */ + len = eng.cod[2] | (eng.cod[3] << 8); + eng.so[bas] = eng.ss[eng.off]; + if (eng.sv[eng.off] >= eng.so[eng.off]) + /* Something matched earlier, update */ + eng.eo[bas] = eng.sv[eng.off]; + else if (eng.eo[bas] < eng.so[bas]) + /* Empty match */ + eng.eo[bas] = eng.so[bas]; + + /* Try next pattern at correct offset */ + eng.str = eng.bas + eng.eo[bas]; + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + } + continue; + + case Re_Maybe: + bas = eng.cod[1]; + if (eng.off == bas) { + /* First iteration */ + REPETITION_SETUP(); + } + else { + /* Matched or first iteration is done */ + len = eng.cod[2] | (eng.cod[3] << 8); + eng.so[bas] = eng.ss[eng.off]; + if (eng.eo[eng.off] > eng.so[eng.off]) { + /* Something matched earlier, update */ + eng.eo[bas] = eng.eo[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + /* Don't need to update counter */ + } + else { + /* Empty match */ + if (eng.eo[bas] < eng.so[bas]) + eng.eo[bas] = eng.so[bas]; + + /* Try next pattern at correct offset */ + eng.str = eng.bas + eng.eo[bas]; + } + + /* Pop stack */ + --eng.off; + + /* Skip code */ + eng.cod += len; + } + continue; + + case Re_AtLeast: + bas = eng.cod[1]; + if (eng.off == bas) { + /* First iteration */ + REPETITION_SETUP(); + } + else { + if (eng.eo[eng.off] >= eng.so[eng.off] && + eng.eo[eng.off] > eng.sv[eng.off]) { + /* Update offset of match */ + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Skip repetition instruction */ + eng.cod += 4; + } + else { + /* Last try failed */ + len = eng.cod[2] | (eng.cod[3] << 8); + if (eng.sv[eng.off] >= eng.so[eng.off]) { + /* Something matched earlier, update */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + } + else { + /* Do it here, so that the fail label does + * not need to do too expensive work for + * simple patterns. */ + eng.so[bas] = eng.str - eng.bas; + + /* Zero matches, pop stack and restart */ + --eng.off; + goto fail; + } + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + } + continue; + + + /**************************************************** + * Repetition with arguments * + ****************************************************/ + case Re_Exact: +#define COMPLEX_REPETITION_SETUP_0() \ + i = eng.cod[1]; \ + bas = eng.cod[2]; + +#define COMPLEX_REPETITION_SETUP() \ + /* First iteration */ \ + if (++eng.off >= MAX_DEPTH) \ + return (RE_ASSERT); \ + \ + /* Remeber number or repetitions */ \ + eng.re[eng.off] = 0; \ + \ + /* Return here for recovery if match fail */ \ + eng.rcod[eng.off] = eng.cod; \ + \ + /* Setup match offsets */ \ + if (eng.so[bas] <= eng.eo[bas]) \ + eng.so[eng.off] = eng.eo[bas]; \ + else \ + eng.so[eng.off] = eng.so[bas]; \ + eng.sv[eng.off] = eng.eo[eng.off] = eng.so[eng.off] - 1;\ + eng.ss[eng.off] = eng.so[bas]; \ + \ + /* Skip repetition instruction */ \ + eng.cod += 5; + + COMPLEX_REPETITION_SETUP_0(); + if (eng.off == bas) { + /* First iteration */ + COMPLEX_REPETITION_SETUP(); + } + else { + if (eng.eo[eng.off] >= eng.so[eng.off] && + eng.eo[eng.off] > eng.sv[eng.off]) { + /* Update offset of match */ + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Update repetition counter */ + if (++eng.re[eng.off] == i) { + /* Matched the required times */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + + /* Update code */ + k = eng.cod[3] | (eng.cod[4] << 8); + eng.cod += k; + + /* Pop stack and go for next pattern */ + --eng.off; + continue; + } + + /* Skip repetition instruction */ + eng.cod += 5; + } + else { + /* Do it here, so that the fail label does + * not need to do too expensive work for + * simple patterns. */ + eng.so[bas] = eng.str - eng.bas; + + /* Pop stack and restart */ + --eng.off; + goto fail; + } + } + continue; + + case Re_Min: + COMPLEX_REPETITION_SETUP_0(); + if (eng.off == bas) { + /* First iteration */ + COMPLEX_REPETITION_SETUP(); + } + else { + if (eng.eo[eng.off] >= eng.so[eng.off] && + eng.eo[eng.off] > eng.sv[eng.off]) { + /* Update offset of match */ + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Update repetition counter */ + ++eng.re[eng.off]; + + /* Skip repetition instruction and try again */ + eng.cod += 5; + } + else { + /* Match failed! */ + if (eng.re[eng.off] < i) { + /* Do it here, so that the fail label does + * not need to do too expensive work for + * simple patterns. */ + eng.so[bas] = eng.str - eng.bas; + + /* Didn't match required number of times */ + --eng.off; + goto fail; + } + else { + /* Matched minimum number of times */ + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + k = eng.cod[3] | (eng.cod[4] << 8); + + /* Update code and pop stack */ + eng.cod += k; + --eng.off; + } + } + } + continue; + + case Re_Max: + COMPLEX_REPETITION_SETUP_0(); + if (eng.off == bas) { + /* First iteration */ + COMPLEX_REPETITION_SETUP(); + } + else { + if (eng.eo[eng.off] >= eng.so[eng.off] && + eng.eo[eng.off] > eng.sv[eng.off]) { + /* Update offset of match */ + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Update repetition counter */ + if (++eng.re[eng.off] == i) { + /* Matched the maximum times */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + + k = eng.cod[3] | (eng.cod[4] << 8); + + /* Update code and pop stack */ + eng.cod += k; + --eng.off; + continue; + } + + /* Skip repetition instruction and try again */ + eng.cod += 5; + } + else { + /* No matches, but zero matches are ok */ + k = eng.cod[3] | (eng.cod[4] << 8); + if (eng.sv[eng.off] >= eng.so[eng.off]) { + /* Something matched earlier, update */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + } + else { + /* Empty match */ + if (eng.eo[bas] < eng.so[bas]) + eng.eo[bas] = eng.so[bas]; + + /* Try next pattern at correct offset */ + eng.str = eng.bas + eng.eo[bas]; + } + + /* Pop stack and update code */ + --eng.off; + eng.cod += k; + } + } + continue; + + case Re_MinMax: + bas = eng.cod[3]; + if (eng.off == bas) { + /* First iteration */ + COMPLEX_REPETITION_SETUP(); + } + else { + if (eng.eo[eng.off] >= eng.so[eng.off] && + eng.eo[eng.off] > eng.sv[eng.off]) { + /* Update offset of match */ + eng.sv[eng.off] = eng.eo[eng.off]; + + /* Update repetition counter */ + if (++eng.re[eng.off] == eng.cod[2]) { + /* Matched the maximum times */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + k = eng.cod[4] | (eng.cod[5] << 8); + + /* Update code and pop stack */ + eng.cod += k; + --eng.off; + continue; + } + + /* Skip repetition instruction and try again */ + eng.cod += 6; + } + else { + /* Match failed! */ + if (eng.re[eng.off] < eng.cod[1]) { + /* Do it here, so that the fail label does + * not need to do too expensive work for + * simple patterns. */ + eng.so[bas] = eng.str - eng.bas; + + /* Didn't match required number of times */ + --eng.off; + goto fail; + } + else { + /* Matched minimum number of times */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.sv[eng.off]; + eng.str = eng.bas + eng.eo[bas]; + k = eng.cod[4] | (eng.cod[5] << 8); + + /* Update code and pop stack */ + eng.cod += k; + --eng.off; + } + } + } + continue; + + + /**************************************************** + * Special repetition handling * + ****************************************************/ + case Re_AnyAnyTimes: + /* code(1) + bas(1) + gbas(1) + jump(2) */ + bas = eng.cod[1]; + if (eng.off == bas) { + /* First iteration */ + if (++eng.off >= MAX_DEPTH) + return (RE_ASSERT); + + /* Return here for recovery if match fail */ + eng.rcod[eng.off] = eng.cod; + + /* If fail, test the next pattern at the same point */ + eng.rstr[eng.off] = eng.str; + + /* Setup match offsets */ + eng.so[eng.off] = eng.str - eng.bas; + eng.eo[eng.off] = eng.so[eng.off] - 1; + + if (newline) + /* Use the repetition counter to store start of + * skipped string, to later check if skipping a + * newline. */ + eng.re[eng.off] = eng.so[eng.off]; + + /* Save start of possible previous matches */ + eng.ss[eng.off] = eng.so[bas]; + + /* Skip repetition instruction */ + eng.cod += 5; + } + else { + /* -1 as an unsigned char */ + if (eng.cod[2] != 0xff) + eng.goff = eng.cod[2]; + else + eng.goff = -1; + + if (newline) { + ptr = eng.bas + eng.re[eng.off]; + str = eng.bas + eng.so[eng.off]; + for (; ptr < str; ptr++) + if (*ptr == '\n') { + eng.cod = eng.rcod[0]; + eng.so[0] = ptr - eng.bas + 1; + eng.eo[0] = eng.so[0] - 1; + eng.rstr[0] = eng.str = ptr + 1; + eng.off = 0; + goto reset; + } + /* If looping, don't do too many noops */ + eng.re[eng.off] = ptr - eng.bas; + } + + if (eng.eo[eng.off] >= eng.so[eng.off]) { + /* Note that this is only true if all possibly + * nested special repetitions also matched. */ + + if (eng.goff >= 0) { + if (eng.cod[5] == Re_Update) + eng.gso[eng.goff] = eng.eo[bas] + + (eng.so[bas] > eng.eo[bas]); + else if (eng.geo[eng.goff] < eng.so[eng.off]) + eng.geo[eng.goff] = eng.so[eng.off]; + } + + /* Jump relative offset */ + len = eng.cod[3] | (eng.cod[4] << 8); + + /* Restore offset from where started trying */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.eo[eng.off]; + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + else { + /* Only give up if the entire string was scanned */ + if (eng.str < eng.end) { + /* Update restart point for next pattern */ + eng.str = ++eng.rstr[eng.off]; + + /* Reset start of nested match */ + eng.so[eng.off] = eng.str - eng.bas; + + /* Skip repetition instruction */ + eng.cod += 5; + } + else { + /* Entire string scanned and failed */ + + /* Jump relative offset */ + len = eng.cod[3] | (eng.cod[4] << 8); + + /* Restore offset from where started trying */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.ss[eng.off] - 1; + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + } + } + continue; + + /* This is significantly different than matching .* + * because it may need to restart several times since it is + * possible to find too many false positives, for example: + * a.*b => once one "a" is found, scan all + * the remaining string searching for a "b" + * a.?b => the string may have too many "a"s, but the + * first occurrences of "a" may not be followed + * by any-character and a "b" or a single "b". + */ + case Re_AnyMaybe: + bas = eng.cod[1]; + if (eng.off == bas) { + /* First iteration */ + if (++eng.off >= MAX_DEPTH) + return (RE_ASSERT); + + /* Return here for recovery if match fail */ + eng.rcod[eng.off] = eng.cod; + + /* First try without eating a byte */ + eng.rstr[eng.off] = eng.str; + + /* Remember this is the first try if match fail */ + eng.re[eng.off] = 0; + + /* Setup match offsets */ + eng.so[eng.off] = eng.str - eng.bas; + eng.eo[eng.off] = eng.so[eng.off] - 1; + + /* Save start of possible previous matches */ + eng.ss[eng.off] = eng.so[bas]; + + /* Skip repetition instruction */ + eng.cod += 5; + } + else { + /* -1 as an unsigned char */ + if (eng.cod[2] != 0xff) + eng.goff = eng.cod[2]; + else + eng.goff = -1; + + if (eng.eo[eng.off] >= eng.so[eng.off]) { + /* Something matched */ + + if (eng.goff >= 0) { + if (eng.cod[5] == Re_Update) + eng.gso[eng.goff] = eng.eo[bas] + + (eng.so[bas] > eng.eo[bas]); + else if (eng.geo[eng.goff] < eng.so[eng.off]) + eng.geo[eng.goff] = eng.so[eng.off]; + } + + /* Jump relative offset */ + len = eng.cod[3] | (eng.cod[4] << 8); + + /* Update offset of match */ + eng.eo[bas] = eng.eo[eng.off]; + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + else if (eng.re[eng.off] == 0 && + (!newline || eng.rstr[eng.off][1] != '\n')) { + /* Try this time skiping a byte */ + ++eng.re[eng.off]; + + /* Reset string, skip code and go try one time more */ + eng.str = ++eng.rstr[eng.off]; + eng.cod += 5; + } + else { + /* Failed to match */ + + /* Update offsets */ + eng.eo[bas] = eng.ss[eng.off]; + eng.so[bas] = eng.eo[bas] + 1; + + eng.str = eng.rstr[eng.off] + (eng.re[eng.off] == 0); + + /* Pop stack and return to toplevel code */ + --eng.off; + if (eng.str >= eng.end) + goto wont; + eng.cod = eng.rcod[bas]; + } + } + continue; + + /* .+ almost identical to .* but requires eating at least one byte */ + case Re_AnyAtLeast: + bas = eng.cod[1]; + if (eng.off == bas) { + /* First iteration */ + if (++eng.off >= MAX_DEPTH) + return (RE_ASSERT); + + /* Return here for recovery if match fail */ + eng.rcod[eng.off] = eng.cod; + + /* Skip one byte for the restart string */ + if (newline && eng.str[0] == '\n') { + /* Cannot skip newline */ + eng.cod = eng.rcod[0]; + eng.rstr[0] = ++eng.str; + eng.so[0] = eng.str - eng.bas; + eng.eo[0] = eng.so[0] - 1; + eng.off = 0; + goto reset; + } + eng.rstr[eng.off] = ++eng.str; + + /* Setup match offsets */ + eng.so[eng.off] = eng.str - eng.bas; + eng.eo[eng.off] = eng.so[eng.off] - 1; + + if (newline) + /* Use the repetition counter to store start of + * skipped string, to later check if skipping a + * newline. */ + eng.re[eng.off] = eng.so[eng.off]; + + /* Save start of possible previous matches */ + eng.ss[eng.off] = eng.so[bas]; + + /* Skip repetition instruction */ + eng.cod += 5; + } + else { + /* -1 as an unsigned char */ + if (eng.cod[2] != 0xff) + eng.goff = eng.cod[2]; + else + eng.goff = -1; + + if (newline) { + ptr = eng.bas + eng.re[eng.off]; + str = eng.bas + eng.so[eng.off]; + for (; ptr < str; ptr++) + if (*ptr == '\n') { + eng.cod = eng.rcod[0]; + eng.so[0] = ptr - eng.bas + 1; + eng.eo[0] = eng.so[0] - 1; + eng.rstr[0] = eng.str = ptr + 1; + eng.off = 0; + goto reset; + } + /* If looping, don't do too many noops */ + eng.re[eng.off] = ptr - eng.bas; + } + + if (eng.eo[eng.off] >= eng.so[eng.off]) { + /* Note that this is only true if all possibly + * nested special repetitions also matched. */ + + if (eng.goff >= 0) { + if (eng.cod[5] == Re_Update) + eng.gso[eng.goff] = eng.eo[bas] + + (eng.so[bas] > eng.eo[bas]); + else if (eng.geo[eng.goff] < eng.so[eng.off]) + eng.geo[eng.goff] = eng.so[eng.off]; + } + + /* Jump relative offset */ + len = eng.cod[3] | (eng.cod[4] << 8); + + /* Restore offset from where started trying */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.eo[eng.off]; + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + else { + /* Only give up if the entire string was scanned */ + if (eng.str < eng.end) { + /* Update restart point for next pattern */ + eng.str = ++eng.rstr[eng.off]; + + /* Reset start of nested match */ + eng.so[eng.off] = eng.str - eng.bas; + + /* Skip repetition instruction */ + eng.cod += 5; + } + else { + /* Entire string scanned and failed */ + + /* Jump relative offset */ + len = eng.cod[3] | (eng.cod[4] << 8); + + /* Restore offset from where started trying */ + eng.so[bas] = eng.ss[eng.off]; + eng.eo[bas] = eng.ss[eng.off] - 1; + + /* Pop stack and skip code */ + --eng.off; + eng.cod += len; + } + } + } + continue; + + + /**************************************************** + * Repetition matched! * + ****************************************************/ + case Re_RepJump: + /* eng.cod[1] is toplevel offset of repetition */ + if (eng.off > eng.cod[1]) + /* If still needs to try matches */ + eng.cod -= eng.cod[2]; + else + eng.cod += 3; /* + RepJump + bas + len-size */ + continue; + + case Re_RepLongJump: + /* eng.cod[1] is toplevel offset of repetition */ + if (eng.off > eng.cod[1]) + /* If still needs to try matches */ + eng.cod -= eng.cod[2] | (eng.cod[3] << 8); + else + eng.cod += 4; /* + RepLongJump + bas + len-size */ + continue; + + /**************************************************** + * Finished * + ****************************************************/ + case Re_DoneIf: + if (eng.eo[eng.off] >= eng.so[eng.off]) { + eng.so[0] = eng.ss[eng.off]; + eng.eo[0] = eng.eo[eng.off]; + goto done; + } + ++eng.cod; + continue; + case Re_MaybeDone: + if (eng.eo[eng.off] >= eng.so[eng.off]) { + eng.so[0] = eng.ss[eng.off]; + eng.eo[0] = eng.eo[eng.off]; + goto done; + } + ++eng.cod; + continue; + case Re_Done: + goto done; + + default: + /* Fatal internal error */ + return (RE_ASSERT); + } + + +wont: + /* Surely won't match */ + if (eng.off == 0) { + eng.eo[0] = eng.so[0] - 1; + break; + } + + +fail: + if (eng.off == 0) { + /* If the entire string scanned */ + if (++eng.str > eng.end) { + eng.eo[0] = eng.so[0] - 1; + break; + } + eng.goff = -1; + /* Update start of possible match after restart */ + if (eng.eo[0] >= eng.so[0]) { + /* If first fail */ + eng.str = eng.rstr[0]; + ++eng.rstr[0]; + eng.so[0] = eng.str - eng.bas; + eng.eo[0] = eng.so[eng.off] - 1; + } + else + /* Just trying at next byte */ + ++eng.so[0]; + } + else + /* Remember this match failed */ + eng.eo[eng.off] = eng.so[eng.off] - 1; + + /* Restart code */ + eng.cod = eng.rcod[eng.off]; + continue; + + +match: + /* If first match */ + if (eng.eo[eng.off] < eng.so[eng.off]) { + if (eng.off == 0) + eng.rstr[0] = eng.str + 1; + eng.so[eng.off] = eng.eo[eng.off] = eng.str - eng.bas; + } + eng.eo[eng.off] += si; + eng.cod += ci; + eng.str += si; + ci = si = 1; + continue; + +done: + break; + } + + if (nmatch) { + if (flags & RE_STARTEND) + len = pmat[0].rm_so; + else + len = 0; + if (!nosub) { + if (preg->cod[1] != 0xff) + eng.goff = preg->cod[1]; + pmat[0].rm_so = eng.so[0]; + pmat[0].rm_eo = eng.eo[0]; + for (i = 1; i < nmatch; i++) { + if (i - 1 <= eng.goff) { + pmat[i].rm_so = eng.gso[i - 1]; + pmat[i].rm_eo = eng.geo[i - 1]; + } + else { + pmat[i].rm_so = 0; + pmat[i].rm_eo = -1; + } + } + if (len) { + /* Update offsets, since the match was done in a substring */ + j = eng.goff + 2 > nmatch ? nmatch : eng.goff + 2; + for (i = 0; i < j; i++) { + pmat[i].rm_so += len; + pmat[i].rm_eo += len; + } + } + } + else { + /* Already know these values, allow compiling the regex with + * RE_NOSUB to use parenthesis only for grouping, but avoiding + * the runtime overhead of keeping track of the subexpression + * offsets. */ + pmat[0].rm_so = eng.so[0] + len; + pmat[0].rm_eo = eng.eo[0] + len; + } + } + + return (eng.so[0] <= eng.eo[0] ? 0 : RE_NOMATCH); +} + +int +reerror(int ecode, const re_cod *preg, char *ebuffer, int ebuffer_size) +{ + static char *errors[] = { + "No error", + "Failed to match", /* NOMATCH */ + + /* Errors not generated */ + "Invalid regular expression", /* BADPAT */ + "Invalid collating element", /* ECOLLATE */ + "Invalid character class", /* ECTYPE */ + + "`\' applied to unescapable character", /* EESCAPE */ + "Invalid backreference number", /* ESUBREG */ + "Brackets `[ ]' not balanced", /* EBRACK */ + "Parentheses `( )' not balanced", /* EPAREN */ + "Braces `{ }' not balanced", /* EBRACE */ + "Invalid repetition count(s) in `{ }'", /* BADBR */ + "Invalid character range in `[ ]'", /* ERANGE */ + "Out of memory", /* ESPACE */ + "`?', `*', or `+' operand invalid", /* BADRPT */ + "Empty (sub)expression", /* EMPTY */ + "Assertion error - you found a bug", /* ASSERT */ + "Invalid argument" /* INVARG */ + }; + char *str; + + if (ecode >= 0 && ecode < sizeof(errors) / sizeof(errors[0])) + str = errors[ecode]; + else + str = "Unknown error"; + + return (snprintf(ebuffer, ebuffer_size, "%s", str)); +} + +void +refree(re_cod *cod) +{ + free(cod->cod); + cod->cod = NULL; +} + +static void +reinit(void) +{ + int i; + static int first = 1; + + if (!first) + return; + first = 0; + + re__alnum['_'] = 1; + + for (i = '0'; i <= '7'; i++) + re__alnum[i] = re__odigit[i] = re__ddigit[i] = re__xdigit[i] = 1; + + for (; i <= '9'; i++) + re__alnum[i] = re__ddigit[i] = re__xdigit[i] = 1; + + for (i = 'a'; i <= 'f'; i++) + re__alnum[i] = re__xdigit[i] = 1; + for (; i <= 'z'; i++) + re__alnum[i] = 1; + + for (i = 'A'; i <= 'F'; i++) + re__alnum[i] = re__xdigit[i] = 1; + for (; i <= 'Z'; i++) + re__alnum[i] = 1; + + for (i = 1; i < 32; i++) + re__control[i] = 1; + re__control[127] = 1; + /* Don't show tabs as control characters */ + re__control['\t'] = 0; +} + +static int +rec_check(re_inf *inf, int count) +{ + if (inf->len + count >= inf->spc) { + int spc; + unsigned char *cod; + + if ((spc = (count % 64)) != 0) + spc = 64 - spc; + spc += count + inf->spc; + if ((cod = realloc(inf->cod, spc)) == NULL) + return (inf->ecode = RE_ESPACE); + inf->cod = cod; + inf->spc = spc; + } + + return (inf->ecode); +} + +static int +rec_code(re_inf *inf, ReCode code) +{ + if (rec_check(inf, 1) == 0) + inf->cod[inf->len++] = code; + + return (inf->ecode); +} + +static int +rec_byte(re_inf *inf, int value) +{ + if (rec_check(inf, 1) == 0) + inf->cod[inf->len++] = value; + + return (inf->ecode); +} + +static int +rec_code_byte(re_inf *inf, ReCode code, int value) +{ + if (rec_check(inf, 2) == 0) { + inf->cod[inf->len++] = code; + inf->cod[inf->len++] = value; + } + + return (inf->ecode); +} + +static int +rec_length(re_inf *inf, int length) +{ + int lo, hi, two; + + if (length >= 16384) + return (inf->ecode = RE_ESPACE); + + lo = length & 0xff; + hi = length & 0xff00; + two = ((length > 0x7f) != 0) + 1; + if (two == 2) { + hi <<= 1; + hi |= (lo & 0x80) != 0; + lo |= 0x80; + } + + if (rec_check(inf, two) == 0) { + inf->cod[inf->len++] = lo; + if (two == 2) + inf->cod[inf->len++] = hi >> 8; + } + + return (inf->ecode); +} + +static int +rec_byte_byte(re_inf *inf, int value0, int value1) +{ + if (rec_check(inf, 2) == 0) { + inf->cod[inf->len++] = value0; + inf->cod[inf->len++] = value1; + } + + return (inf->ecode); +} + +static int +rec_code_byte_byte(re_inf *inf, ReCode code, int value0, int value1) +{ + if (rec_check(inf, 3) == 0) { + inf->cod[inf->len++] = code; + inf->cod[inf->len++] = value0; + inf->cod[inf->len++] = value1; + } + + return (inf->ecode); +} + +static int +rec_build_alt(re_inf *inf, rec_alt *alt) +{ + int offset, value, bas = inf->bas + 1; + + if (alt) { + if (alt->next) { + if (rec_inc_spc(inf)) + return (inf->ecode); + + /* A real a list of alternatives */ + rec_code(inf, Re_Alt); + + offset = inf->len; /* Remember current offset */ + rec_byte_byte(inf, 0, 0); /* Reserve two bytes for retry address */ + while (alt && inf->ecode == 0) { + if (rec_build_pat(inf, alt->pat)) + break; + alt = alt->next; + if (alt && inf->ecode == 0) { + /* Handle (hyper)complex repetitions */ + if (inf->bas != bas) { + /* Duplicate patterns up to end of expression */ + rec_build_pat(inf, inf->apat); + /* Restore engine state for next alternative(s) */ + rec_alt_spc(inf, bas - 1); + } + + /* If the jump would be so long */ + if ((value = inf->len - offset) >= 16384) { + inf->ecode = RE_ESPACE; + break; + } + inf->cod[offset] = value & 0xff; + inf->cod[offset + 1] = (value & 0xff00) >> 8; + + rec_code(inf, Re_AltNext); + offset = inf->len; + rec_byte_byte(inf, 0, 0); + } + } + if (inf->ecode == 0) { + /* Handle (hyper)complex repetitions */ + if (inf->bas != bas) { + /* Duplicate patterns up to end of expression */ + rec_build_pat(inf, inf->apat); + /* Restore engine state for next alternative(s) */ + rec_alt_spc(inf, bas - 1); + } + + /* If the jump would be so long */ + if ((value = inf->len - offset) >= 16384) + return (inf->ecode = RE_ESPACE); + inf->cod[offset] = value & 0xff; + inf->cod[offset + 1] = (value & 0xff00) >> 8; + /* Last jump is here */ + rec_code(inf, Re_AltDone); + } + rec_dec_spc(inf); + } + else + /* Single alternative */ + rec_build_pat(inf, alt->pat); + } + + return (inf->ecode); +} + +static int +rec_build_pat(re_inf *inf, rec_pat *pat) +{ + rec_pat *apat; + int length, offset = 0, distance, jump = 0, bas = 0; + + while (pat && inf->ecode == 0) { + if (pat->rep) { + bas = inf->bas; + if (pat->type == Rep_Group && !inf->par && rec_code(inf, Re_Open)) + return (inf->ecode); + if (rec_inc_spc(inf)) + return (inf->ecode); + offset = inf->len; + if (rec_build_rep(inf, pat->rep)) + break; + /* Reserve space to jump after repetition done */ + jump = inf->len; + rec_byte_byte(inf, 0, 0); + } + switch (pat->type) { + case Rep_AnyAnyTimes: + case Rep_AnyMaybe: + case Rep_AnyAtLeast: + if (rec_add_spc(inf, pat->type == Rep_AnyMaybe)) + return (inf->ecode); + if (rec_code(inf, (ReCode)pat->type) == 0 && + rec_byte(inf, inf->bas - 1) == 0 && + rec_byte(inf, inf->ref - 1) == 0) + rec_off_spc(inf); + break; + case Rep_Literal: + case Rep_LiteralNot: + case Rep_SearchLiteral: + rec_code_byte(inf, (ReCode)pat->type, pat->data.chr); + break; + case Rep_CaseLiteral: + case Rep_CaseLiteralNot: + case Rep_SearchCaseLiteral: + rec_code_byte_byte(inf, (ReCode)pat->type, + pat->data.cse.lower, pat->data.cse.upper); + break; + case Rep_Range: + case Rep_RangeNot: + if (rec_code(inf, (ReCode)pat->type) == 0) + rec_build_rng(inf, pat->data.rng); + break; + case Rep_String: + case Rep_SearchString: + case Rep_CaseString: + case Rep_SearchCaseString: + rec_code(inf, (ReCode)pat->type); + length = strlen((char*)pat->data.str); + if (rec_length(inf, length) == 0 && rec_check(inf, length) == 0) { + memcpy(inf->cod + inf->len, pat->data.str, length); + inf->len += length; + } + break; + case Rep_Any: + case Rep_AnyEatAnyTimes: + case Rep_AnyEatMaybe: + case Rep_AnyEatAtLeast: + case Rep_Odigit: + case Rep_OdigitNot: + case Rep_Digit: + case Rep_DigitNot: + case Rep_Xdigit: + case Rep_XdigitNot: + case Rep_Space: + case Rep_SpaceNot: + case Rep_Tab: + case Rep_Newline: + case Rep_Lower: + case Rep_Upper: + case Rep_Alnum: + case Rep_AlnumNot: + case Rep_Control: + case Rep_ControlNot: + case Rep_Bol: + case Rep_Eol: + case Rep_Bow: + case Rep_Eow: + rec_code(inf, (ReCode)pat->type); + break; + case Rep_Backref: + rec_code_byte(inf, Re_Backref, pat->data.chr); + break; + case Rep_Group: + if (pat->rep == NULL && !inf->par && rec_code(inf, Re_Open)) + break; + apat = inf->apat; + inf->apat = pat->next; + rec_build_grp(inf, pat->data.grp); + inf->apat = apat; + break; + case Rep_StringList: + rec_build_stl(inf, pat->data.stl); + break; + } + if (pat->rep) { +#if 0 + if (rec_dec_spc(inf)) + return (inf->ecode); +#else + if (rec_rep_spc(inf, bas)) + return (inf->ecode); +#endif + distance = inf->len - offset; + if (distance > 255) { + if (rec_code(inf, Re_RepLongJump) || + rec_byte(inf, inf->bas) || + rec_byte(inf, distance & 0xff) || + rec_byte(inf, (distance & 0xff00) >> 8)) + break; + } + else if (rec_code(inf, Re_RepJump) || + rec_byte(inf, inf->bas) || + rec_byte(inf, distance)) + break; + distance = inf->len - offset; + inf->cod[jump] = distance & 0xff; + inf->cod[jump + 1] = (distance & 0xff00) >> 8; + } + pat = pat->next; + } + + return (inf->ecode); +} + +static int +rec_build_rng(re_inf *inf, rec_rng *rng) +{ + if (rec_check(inf, sizeof(rng->range)) == 0) { + memcpy(inf->cod + inf->len, rng->range, sizeof(rng->range)); + inf->len += sizeof(rng->range); + } + + return (inf->ecode); +} + +static int +rec_build_grp(re_inf *inf, rec_grp *grp) +{ + int par = inf->par; + + if (!(inf->flags & RE_NOSUB)) { + ++inf->par; + if (par == 0) + ++inf->ref; + if (rec_build_alt(inf, grp->alt) == 0) { + if (par == 0) { + if (grp->comp) + rec_code_byte(inf, Re_Update, inf->ref - 1); + else + rec_code(inf, Re_Close); + } + } + --inf->par; + } + else + rec_build_alt(inf, grp->alt); + + return (inf->ecode); +} + +static int +rec_build_stl(re_inf *inf, rec_stl *stl) +{ + int i, len, rlen; + ReCode code; + + /* Calculate jump distance information */ + rlen = stl->tlen + stl->nstrs + 4; + /* + code + nstrs + place-offset + data-length */ + + if (stl->nstrs >= LARGE_STL_COUNT) { + rlen += 511; /* Don't write number of strings */ + code = stl->type == Rep_StringList ? + Re_LargeStringList : Re_LargeCaseStringList; + } + else + code = (ReCode)stl->type; + + if (rlen >= 16386) + return (inf->ecode = RE_ESPACE); + if (rec_check(inf, rlen) || + rec_code(inf, code)) + return (inf->ecode); + + /* Space is allocated, just write the data */ + if (stl->nstrs < LARGE_STL_COUNT) + inf->cod[inf->len++] = stl->nstrs; + + inf->cod[inf->len++] = rlen & 0xff; + inf->cod[inf->len++] = (rlen & 0xff00) >> 8; + + if (stl->nstrs < LARGE_STL_COUNT) { + for (i = 0; i < stl->nstrs; i++) + inf->cod[inf->len++] = stl->lens[i]; + for (i = 0; i < stl->nstrs; i++) { + len = stl->lens[i]; + if (len > 2) { + memcpy(inf->cod + inf->len, stl->strs[i], len); + inf->len += len; + } + else { + if (len == 1) + inf->cod[inf->len++] = (long)stl->strs[i]; + else { + inf->cod[inf->len++] = (long)stl->strs[i] & 0xff; + inf->cod[inf->len++] = ((long)stl->strs[i] & 0xff00) >> 8; + } + } + } + } + else { + /* The string length goes before the string itself */ + int j, chl, chu; + + /* Fill everything with an invalid jump address */ + memset(inf->cod + inf->len, 0xff, 512); + for (i = len = 0, j = -1; i < stl->nstrs; i++) { + chl = stl->lens[i] > 2 ? stl->strs[i][0] : (long)stl->strs[i] & 0xff; + if (chl != j) { + inf->cod[inf->len + (chl << 1)] = len & 0xff; + inf->cod[inf->len + (chl << 1) + 1] = (len & 0xff00) >> 8; + if (code == Re_LargeCaseStringList) { + chu = stl->lens[i] > 2 ? + stl->strs[i][1] : ((long)(stl->strs[i]) & 0xff00) >> 8; + inf->cod[inf->len + (chu << 1)] = len & 0xff; + inf->cod[inf->len + (chu << 1) + 1] = (len & 0xff00) >> 8; + } + j = chl; + } + len += stl->lens[i] + 1; + } + inf->len += 512; + + for (i = 0; i < stl->nstrs; i++) { + len = stl->lens[i]; + inf->cod[inf->len++] = len; + if (len > 2) { + memcpy(inf->cod + inf->len, stl->strs[i], len); + inf->len += len; + } + else { + if (len == 1) + inf->cod[inf->len++] = (long)stl->strs[i]; + else { + inf->cod[inf->len++] = (long)stl->strs[i] & 0xff; + inf->cod[inf->len++] = ((long)stl->strs[i] & 0xff00) >> 8; + } + } + } + } + + return (inf->ecode); +} + +static int +rec_build_rep(re_inf *inf, rec_rep *rep) +{ + if (rep) { + switch (rep->type) { + case Rer_AnyTimes: + case Rer_AtLeast: + case Rer_Maybe: + rec_code(inf, (ReCode)rep->type); + break; + case Rer_Exact: + if (rec_code(inf, Re_Exact) == 0) + rec_byte(inf, rep->mine); + break; + case Rer_Min: + if (rec_code(inf, Re_Min) == 0) + rec_byte(inf, rep->mine); + break; + case Rer_Max: + if (rec_code(inf, Re_Max) == 0) + rec_byte(inf, rep->maxc); + break; + case Rer_MinMax: + if (rec_code(inf, Re_MinMax) == 0 && + rec_byte(inf, rep->mine) == 0) + rec_byte(inf, rep->maxc); + break; + } + /* It is incremented in rec_build_pat */ + rec_byte(inf, inf->bas - 1); + } + + return (inf->ecode); +} + +static int +rec_inc_spc(re_inf *inf) +{ + if (++inf->bas >= MAX_DEPTH) + return (inf->ecode = RE_ESPACE); + + return (inf->ecode); +} + +static int +rec_dec_spc(re_inf *inf) +{ + if (--inf->bas < 0) + return (inf->ecode = RE_ASSERT); + + return (inf->ecode); +} + +static int +rec_add_spc(re_inf *inf, int maybe) +{ + if (++inf->bas >= MAX_DEPTH) + return (inf->ecode = RE_ESPACE); + inf->sp[inf->bas] = maybe + 1; + + return (inf->ecode); +} + +/* Could be joined with rec_rep_spc, code almost identical */ +static int +rec_alt_spc(re_inf *inf, int top) +{ + int distance, i, bas = inf->bas; + + while ((inf->bas > top) && inf->sp[inf->bas]) { + /* Jump to this repetition for cleanup */ + distance = inf->len - inf->sr[inf->bas]; + + /* This will generate a jump to a jump decision opcode */ + inf->sj[inf->bas] = inf->len; + + if (distance > 255) { + if (rec_code(inf, Re_RepLongJump) || + rec_byte(inf, inf->bas - 1) || + rec_byte(inf, distance & 0xff) || + rec_byte(inf, (distance & 0xff00) >> 8)) + break; + } + else if (rec_code(inf, Re_RepJump) || + rec_byte(inf, inf->bas - 1) || + rec_byte(inf, distance)) + break; + + /* Top of stack value before repetition, or end condition value */ + --inf->bas; + } + + i = inf->bas + 1; + + if (inf->ecode == 0 && i <= bas && inf->sp[i]) { + /* Only the repetition at the bottom jump to code after testing + * all possibilities */ + distance = inf->len - inf->sr[i]; + inf->cod[inf->sr[i] + 3] = distance & 0xff; + inf->cod[inf->sr[i] + 4] = (distance & 0xff00) >> 8; + + /* The bottom jump is here */ + if (rec_code(inf, inf->sp[i] == 1 ? Re_DoneIf : Re_MaybeDone)) + return (inf->ecode); + + /* Generate jumps to the previous special repetition */ + for (++i; i <= bas; i++) { + if (inf->sp[i]) { + distance = inf->sj[i] - inf->sr[i]; + inf->cod[inf->sr[i] + 3] = distance & 0xff; + inf->cod[inf->sr[i] + 4] = (distance & 0xff00) >> 8; + } + } + } + + return (inf->ecode); +} + +static int +rec_rep_spc(re_inf *inf, int top) +{ + int distance, i, bas = inf->bas; + + while (inf->bas > top) { + if (inf->sp[inf->bas]) { + /* Jump to this repetition for cleanup */ + distance = inf->len - inf->sr[inf->bas]; + + /* This will generate a jump to a jump decision opcode */ + inf->sj[inf->bas] = inf->len; + + if (distance > 255) { + if (rec_code(inf, Re_RepLongJump) || + rec_byte(inf, inf->bas - 1) || + rec_byte(inf, distance & 0xff) || + rec_byte(inf, (distance & 0xff00) >> 8)) + break; + } + else if (rec_code(inf, Re_RepJump) || + rec_byte(inf, inf->bas - 1) || + rec_byte(inf, distance)) + break; + } + + /* Top of stack value before repetition, or end condition value */ + --inf->bas; + } + + /* Find first special repetition offset. XXX This should be a noop */ + for (i = 0; i < bas; i++) + if (inf->sp[i]) + break; + + if (inf->ecode == 0 && i <= bas && inf->sp[i]) { + /* Only the repetition at the bottom jump to code after testing + * all possibilities */ + distance = inf->len - inf->sr[i]; + inf->cod[inf->sr[i] + 3] = distance & 0xff; + inf->cod[inf->sr[i] + 4] = (distance & 0xff00) >> 8; + + /* Generate jumps to the previous special repetition */ + for (++i; i <= bas; i++) { + if (inf->sp[i]) { + distance = inf->sj[i] - inf->sr[i]; + inf->cod[inf->sr[i] + 3] = distance & 0xff; + inf->cod[inf->sr[i] + 4] = (distance & 0xff00) >> 8; + } + } + } + + return (inf->ecode); +} + +static int +rec_off_spc(re_inf *inf) +{ + /* The jump address before the three bytes instruction */ + inf->sr[inf->bas] = inf->len - 3; + /* Don't know yet where to go after done with the special + * repetition, just reserve two bytes for the jump address. */ + return (rec_byte_byte(inf, 0, 0)); +} + +#ifdef DEBUG +static void +redump(re_cod *code) +{ + int i, j, k; + unsigned char *cod = code->cod, *stl; + + if (cod[0] & RE_NOSUB) + printf("Nosub\n"); + if (cod[0] & RE_NEWLINE) + printf("Newline\n"); + ++cod; + if (cod[0] != 0xff) + printf("%d backrefs\n", cod[0] + 1); + ++cod; + for (;;) { + switch (*cod++) { + case Re_Open: + printf("Open"); + break; + case Re_Close: + printf("Close"); + break; + case Re_Update: + printf("Update (%d)", (int)*cod++); + break; + case Re_Alt: + printf("Alt"); + i = cod[0] | cod[1]; + cod += 2; + printf(" %d", i); + break; + case Re_AltNext: + printf("Alt-next"); + i = cod[0] | cod[1]; + cod += 2; + printf(" %d", i); + break; + case Re_AltDone: + printf("Alt-done"); + break; + case Re_AnyTimes: + printf("-> Anytimes %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_AnyEatAnyTimes: + printf("Any-eat-anytimes"); + break; + case Re_AnyAnyTimes: + printf("-> Any-anytimes %d", (int)*cod++); + printf(" (%d)", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_AnyEatMaybe: + printf("Any-eat-maybe"); + break; + case Re_AnyMaybe: + printf("-> Any-maybe %d", (int)*cod++); + printf(" (%d)", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_AnyAtLeast: + printf("-> Any-atleast %d", (int)*cod++); + printf(" (%d)", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_AnyEatAtLeast: + printf("Any-eat-atleast"); + break; + case Re_Maybe: + printf("-> Maybe %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_AtLeast: + printf("-> Atleast %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_Exact: + printf("-> Exact "); + i = *cod++; + printf("%d", i); + printf(" %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_Min: + printf("-> Min "); + i = *cod++; + printf("%d", i); + printf(" %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_Max: + printf("-> Max "); + i = *cod++; + printf("%d", i); + printf(" %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_MinMax: + printf("-> Min-max "); + i = *cod++; + printf("%d ", i); + i = *cod++; + printf("%d", i); + printf(" %d", (int)*cod++); + i = cod[0] | (cod[1] << 8); + cod += 2; + printf(" /%d", i); + break; + case Re_RepJump: + printf("<- Rep-jump %d ", (int)*cod++); + i = *cod++; + printf("%d", i); + break; + case Re_RepLongJump: + printf("<- Rep-long-jump %d ", (int)*cod++); + i = cod[0] | (cod[1] << 8); + printf("%d", i); + break; + case Re_Any: + printf("Any"); + break; + case Re_Odigit: + printf("Odigit"); + break; + case Re_OdigitNot: + printf("Odigit-not"); + break; + case Re_Digit: + printf("Digit"); + break; + case Re_DigitNot: + printf("Digit-not"); + break; + case Re_Xdigit: + printf("Xdigit"); + break; + case Re_XdigitNot: + printf("Xdigit-not"); + break; + case Re_Space: + printf("Space"); + break; + case Re_SpaceNot: + printf("Space-not"); + break; + case Re_Tab: + printf("Tab"); + break; + case Re_Newline: + printf("Newline"); + break; + case Re_Lower: + printf("Lower"); + break; + case Re_Upper: + printf("Upper"); + break; + case Re_Alnum: + printf("Alnum"); + break; + case Re_AlnumNot: + printf("Alnum-not"); + break; + case Re_Control: + printf("Control"); + break; + case Re_ControlNot: + printf("Control-not"); + break; + case Re_Bol: + printf("Bol"); + break; + case Re_Eol: + printf("Eol"); + break; + case Re_Bow: + printf("Bow"); + break; + case Re_Eow: + printf("Eow"); + break; + case Re_Range: + printf("Range "); + goto range; + case Re_RangeNot: + printf("Range-not "); +range: + for (i = 0; i < 256; i += 32) { + for (j = k = 0; j < 32; j++) + k |= (*cod++ & 1) << (31 - j); + printf("%x ", k); + } + break; + case Re_Literal: + printf("Literal %c", *cod++); + break; + case Re_LiteralNot: + printf("Literal-not %c", *cod++); + break; + case Re_SearchLiteral: + printf("Search-literal %c", *cod++); + break; + case Re_CaseLiteral: + printf("Case-literal %c", *cod++); + putchar(*cod++); + break; + case Re_CaseLiteralNot: + printf("Case-literal-not %c", *cod++); + putchar(*cod++); + break; + case Re_SearchCaseLiteral: + printf("Search-case-literal %c", *cod++); + putchar(*cod++); + break; + case Re_String: + printf("String "); + goto string; + case Re_SearchString: + printf("Search-string "); + goto string; + case Re_CaseString: + printf("Case-string "); + goto string; + case Re_SearchCaseString: + printf("Search-case-string "); +string: + i = *cod++; + if (i & 0x80) + i = (i & 0x7f) | (*cod++ << 7); + for (j = 0; j < i; j++) + putchar(*cod++); + break; + case Re_StringList: + printf("String-list"); + goto string_list; + case Re_CaseStringList: + printf("Case-string-list"); +string_list: + j = *cod++; + cod += 2; + stl = cod + j; + for (i = 0; i < j; i++) { + k = *cod++; + putchar(i ? ',' : ' '); + fwrite(stl, k, 1, stdout); + stl += k; + } + cod = stl; + break; + case Re_LargeStringList: + printf("Large-string-list"); +large_string_list: + i = cod[0] | (cod[1] << 8); + stl = cod + i - 1; + for (i = 0, cod += 514; cod < stl; i++) { + k = *cod++; + putchar(i ? ',' : ' '); + fwrite(cod, k, 1, stdout); + cod += k; + } + cod = stl; + break; + case Re_LargeCaseStringList: + printf("Large-case-string-list"); + goto large_string_list; + case Re_Backref: + printf("Backref %d", (int)*cod++); + break; + case Re_DoneIf: + printf("Done-if"); + break; + case Re_MaybeDone: + printf("Maybe-done"); + break; + case Re_Done: + printf("Done\n"); + return; + } + putchar('\n'); + } +} +#endif diff --git a/lisp/re/re.h b/lisp/re/re.h new file mode 100644 index 0000000..332366e --- /dev/null +++ b/lisp/re/re.h @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2002 by The XFree86 Project, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF + * OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Except as contained in this notice, the name of the XFree86 Project shall + * not be used in advertising or otherwise to promote the sale, use or other + * dealings in this Software without prior written authorization from the + * XFree86 Project. + * + * Author: Paulo César Pereira de Andrade + */ + +/* $XFree86: xc/programs/xedit/lisp/re/re.h,v 1.2 2002/09/23 01:25:41 paulo Exp $ */ + +#include +#include +#include + +#ifndef _re_h +#define _re_h + +/* + * Defines + */ + + /* Compile flags options */ +#define REG_BASIC 0000 /* Not used */ +#define REG_EXTENDED 0001 /* Not used, only extended supported */ + +#define RE_ICASE 0002 +#define RE_NOSUB 0004 +#define RE_NEWLINE 0010 +#define RE_NOSPEC 0020 +#define RE_PEND 0040 +#define RE_DUMP 0200 + + + + /* Execute flag options */ +#define RE_NOTBOL 1 +#define RE_NOTEOL 2 +#define RE_STARTEND 4 +#define RE_TRACE 00400 /* Not used/supported */ +#define RE_LARGE 01000 /* Not used/supported */ +#define RE_BACKR 02000 /* Not used/supported */ + + /* Value returned by reexec when match fails */ +#define RE_NOMATCH 1 + /* Compile error values */ +#define RE_BADPAT 2 +#define RE_ECOLLATE 3 +#define RE_ECTYPE 4 +#define RE_EESCAPE 5 +#define RE_ESUBREG 6 +#define RE_EBRACK 7 +#define RE_EPAREN 8 +#define RE_EBRACE 9 +#define RE_EBADBR 10 +#define RE_ERANGE 11 +#define RE_ESPACE 12 +#define RE_BADRPT 13 +#define RE_EMPTY 14 +#define RE_ASSERT 15 +#define RE_INVARG 16 +#define RE_ATOI 255 /* Not used/supported */ +#define RE_ITOA 0400 /* Not used/supported */ + + +/* + * Types + */ + +/* (re)gular expression (mat)ch result */ +typedef struct _re_mat { + long rm_so; + long rm_eo; +} re_mat; + +/* (re)gular expression (cod)e */ +typedef struct _re_cod { + unsigned char *cod; + int re_nsub; /* Public member */ + const char *re_endp; /* Support for RE_PEND */ +} re_cod; + + +/* + * Prototypes + */ + /* compile the given pattern string + * returns 0 on success, error code otherwise */ +int recomp(re_cod *preg, const char *pattern, int flags); + + /* execute the compiled pattern on the string. + * returns 0 if matched, RE_NOMATCH if failed, error code otherwise */ +int reexec(const re_cod *preg, const char *string, + int nmat, re_mat pmat[], int flags); + + /* formats an error message for the given code in ebuffer */ +int reerror(int ecode, const re_cod *preg, char *ebuffer, int ebuffer_size); + + /* frees the given parameter */ +void refree(re_cod *preg); + + +#endif /* _re_h */ diff --git a/lisp/re/rec.c b/lisp/re/rec.c new file mode 100644 index 0000000..20f9fd9 --- /dev/null +++ b/lisp/re/rec.c @@ -0,0 +1,1015 @@ +/* + * Copyright (c) 2002 by The XFree86 Project, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF + * OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Except as contained in this notice, the name of the XFree86 Project shall + * not be used in advertising or otherwise to promote the sale, use or other + * dealings in this Software without prior written authorization from the + * XFree86 Project. + * + * Author: Paulo César Pereira de Andrade + */ + +/* $XFree86: xc/programs/xedit/lisp/re/rec.c,v 1.4 2003/01/16 06:25:52 paulo Exp $ */ + +#include "rep.h" + +/* + * Types + */ + +/* Information used while compiling the intermediate format of the re. */ +typedef struct _irec_info { + unsigned char *ptr; /* Pointer in the given regex pattern */ + unsigned char *end; /* End of regex pattern */ + int flags; /* Compile flags */ + rec_alt *alt; /* Toplevel first/single alternative */ + + rec_alt *palt; /* Current alternative being compiled */ + rec_grp *pgrp; /* Current group, if any */ + rec_pat *ppat; /* Current pattern, if any */ + + /* Number of open parenthesis, for error checking */ + int nparens; + + int ngrps; /* Number of groups, for backreference */ + + int ecode; +} irec_info; + + +/* + * Prototypes + */ + + /* (i)ntermediate (r)egular (e)xpression (c)ompile + * Generates an intermediate stage compiled regex from + * the specified pattern argument. Basically builds an + * intermediate data structure to analyse and do syntax + * error checking. + */ +static void irec_simple_pattern(irec_info*, rec_pat_t); +static void irec_literal_pattern(irec_info*, int); +static void irec_case_literal_pattern(irec_info*, int); +static void irec_open_group(irec_info*); +static void irec_close_group(irec_info*); +static void irec_range(irec_info*); +static void irec_range_single(irec_info*, int); +static void irec_range_complex(irec_info*, int, int); +static void irec_escape(irec_info*); +static void irec_simple_repetition(irec_info*, rec_rep_t); +static void irec_complex_repetition(irec_info*); +static void irec_add_repetition(irec_info*, rec_rep*); +static void irec_free(irec_info*); +static void irec_free_grp(rec_grp*); +static void irec_free_pats(rec_pat*); + + +/* + * Implementation + */ +rec_alt * +irec_comp(const char *pattern, const char *endp, int flags, int *ecode) +{ + unsigned char *ptr; + rec_alt *alt; + irec_info inf; + + if (pattern == NULL || endp < pattern) { + *ecode = RE_INVARG; + return (NULL); + } + + if (endp == pattern) { + *ecode = RE_EMPTY; + return (NULL); + } + + alt = calloc(1, sizeof(rec_alt)); + if (alt == NULL) { + *ecode = RE_ESPACE; + return (NULL); + } + + inf.ptr = (unsigned char*)pattern; + inf.end = (unsigned char*)endp; + inf.flags = flags; + inf.alt = inf.palt = alt; + inf.pgrp = NULL; + inf.ppat = NULL; + inf.nparens = inf.ngrps = 0; + inf.ecode = 0; + + if (flags & RE_NOSPEC) { + /* Just searching for a character or substring */ + for (; inf.ecode == 0 && inf.ptr < inf.end; inf.ptr++) { + if (!(flags & RE_ICASE) || + (!isupper(*inf.ptr) && !islower(*inf.ptr))) + irec_literal_pattern(&inf, *inf.ptr); + else + irec_case_literal_pattern(&inf, *inf.ptr); + } + } + /* inf.ptr = inf.end is nul if flags & RE_NOSPEC */ + for (; inf.ecode == 0 && inf.ptr < inf.end;) { + switch (*inf.ptr++) { + case '*': + irec_simple_repetition(&inf, Rer_AnyTimes); + break; + case '+': + irec_simple_repetition(&inf, Rer_AtLeast); + break; + case '?': + irec_simple_repetition(&inf, Rer_Maybe); + break; + case '.': + irec_simple_pattern(&inf, Rep_Any); + break; + case '^': + if (flags & RE_NEWLINE) + /* It is up to the user decide if this can match */ + irec_simple_pattern(&inf, Rep_Bol); + else { + for (ptr = inf.ptr - 1; + ptr > (unsigned char*)pattern && *ptr == '('; ptr--) + ; + /* If at the start of a pattern */ + if (ptr == (unsigned char*)pattern || *ptr == '|') + irec_simple_pattern(&inf, Rep_Bol); + else + /* In the middle of a pattern, treat as literal */ + irec_literal_pattern(&inf, '^'); + } + break; + case '$': + if (flags & RE_NEWLINE) + irec_simple_pattern(&inf, Rep_Eol); + else { + /* Look ahead to check if is the last char of a group */ + for (ptr = inf.ptr; ptr < inf.end && *ptr == ')'; ptr++) + ; + if (*ptr == '\0' || *ptr == '|') + /* Last character of pattern, an EOL match */ + irec_simple_pattern(&inf, Rep_Eol); + else + /* Normal character */ + irec_literal_pattern(&inf, '$'); + } + break; + case '(': + irec_open_group(&inf); + break; + case ')': + /* Look ahead to check if need to close the group now */ + ptr = inf.ptr; + if (*ptr != '*' && *ptr != '+' && *ptr != '?' && *ptr != '{') + /* If a repetition does not follow */ + irec_close_group(&inf); + else if (inf.pgrp == NULL) + /* A repetition follows, but current group is implicit */ + inf.ecode = RE_EPAREN; + else + /* Can do this as next character is known */ + inf.ppat = NULL; + break; + case '[': + irec_range(&inf); + break; + case ']': + irec_literal_pattern(&inf, ']'); + break; + case '{': + irec_complex_repetition(&inf); + break; + case '}': + irec_literal_pattern(&inf, '}'); + break; + case '|': + /* If first character in the pattern */ + if (inf.ptr - 1 == (unsigned char*)pattern || + /* If last character in the pattern */ + inf.ptr >= inf.end || + /* If empty pattern */ + inf.ptr[0] == '|' || + inf.ptr[0] == ')') + inf.ecode = RE_EMPTY; + else { + rec_alt *alt = calloc(1, sizeof(rec_alt)); + + if (alt) { + alt->prev = inf.palt; + inf.palt->next = alt; + inf.palt = alt; + inf.ppat = NULL; + } + else + inf.ecode = RE_ESPACE; + } + break; + case '\\': + irec_escape(&inf); + break; + default: + if (!(flags & RE_ICASE) || + (!isupper(inf.ptr[-1]) && !islower(inf.ptr[-1]))) + irec_literal_pattern(&inf, inf.ptr[-1]); + else + irec_case_literal_pattern(&inf, inf.ptr[-1]); + break; + } + } + + /* Check if not all groups closed */ + if (inf.ecode == 0 && inf.nparens) + inf.ecode = RE_EPAREN; + + if (inf.ecode == 0) + inf.ecode = orec_comp(inf.alt, flags); + + /* If an error generated */ + if (inf.ecode) { + irec_free(&inf); + alt = NULL; + } + + *ecode = inf.ecode; + + return (alt); +} + +void +irec_free_alt(rec_alt *alt) +{ + rec_alt *next; + + while (alt) { + next = alt->next; + irec_free_pats(alt->pat); + free(alt); + alt = next; + } +} + + + +static void +irec_simple_pattern(irec_info *inf, rec_pat_t type) +{ + rec_pat *pat; + + /* Always add a new pattern to list */ + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + pat->type = type; + if ((pat->prev = inf->ppat) != NULL) + inf->ppat->next = pat; + else + inf->palt->pat = pat; + inf->ppat = pat; +} + +static void +irec_literal_pattern(irec_info *inf, int value) +{ + int length; + rec_pat *pat; + unsigned char chr, *str; + + /* If there is a current pattern */ + if (inf->ppat && inf->ppat->rep == NULL) { + switch (inf->ppat->type) { + case Rep_Literal: + /* Start literal string */ + chr = inf->ppat->data.chr; + if ((str = malloc(16)) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + inf->ppat->type = Rep_String; + inf->ppat->data.str = str; + str[0] = chr; + str[1] = value; + str[2] = '\0'; + return; + + case Rep_String: + /* Augments literal string */ + length = strlen((char*)inf->ppat->data.str); + if ((length % 16) >= 14) { + if ((str = realloc(inf->ppat->data.str, + length + 18)) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + inf->ppat->data.str = str; + } + inf->ppat->data.str[length] = value; + inf->ppat->data.str[length + 1] = '\0'; + return; + + default: + /* Anything else is added as a new pattern list element */ + break; + } + } + + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + pat->type = Rep_Literal; + pat->data.chr = value; + if ((pat->prev = inf->ppat) != NULL) + inf->ppat->next = pat; + else + inf->palt->pat = pat; + inf->ppat = pat; +} + +static void +irec_case_literal_pattern(irec_info *inf, int value) +{ + int length; + rec_pat *pat; + unsigned char plower, pupper, lower, upper, *str; + + lower = tolower(value); + upper = toupper(value); + + /* If there is a current pattern */ + if (inf->ppat && inf->ppat->rep == NULL) { + switch (inf->ppat->type) { + case Rep_CaseLiteral: + /* Start case literal string */ + plower = inf->ppat->data.cse.lower; + pupper = inf->ppat->data.cse.upper; + if ((str = malloc(32)) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + inf->ppat->type = Rep_CaseString; + inf->ppat->data.str = str; + str[0] = plower; + str[1] = pupper; + str[2] = lower; + str[3] = upper; + str[4] = '\0'; + return; + + case Rep_CaseString: + /* Augments case literal string */ + length = strlen((char*)inf->ppat->data.str); + if (((length) % 32) >= 28) { + if ((str = realloc(inf->ppat->data.str, + length + 36)) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + inf->ppat->data.str = str; + } + inf->ppat->data.str[length] = lower; + inf->ppat->data.str[length + 1] = upper; + inf->ppat->data.str[length + 2] = '\0'; + return; + + default: + /* Anything else is added as a new pattern list element */ + break; + } + } + + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + pat->type = Rep_CaseLiteral; + pat->data.cse.lower = lower; + pat->data.cse.upper = upper; + pat->prev = inf->ppat; + if ((pat->prev = inf->ppat) != NULL) + inf->ppat->next = pat; + else + inf->palt->pat = pat; + inf->ppat = pat; +} + +static void +irec_open_group(irec_info *inf) +{ + rec_pat *pat; + rec_alt *alt; + rec_grp *grp; + + if ((grp = calloc(1, sizeof(rec_grp))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + free(grp); + inf->ecode = RE_ESPACE; + return; + } + + if ((alt = calloc(1, sizeof(rec_alt))) == NULL) { + free(grp); + free(pat); + inf->ecode = RE_ESPACE; + return; + } + + pat->type = Rep_Group; + pat->data.grp = grp; + grp->parent = pat; + grp->palt = inf->palt; + grp->pgrp = inf->pgrp; + grp->alt = alt; + grp->comp = 0; + if ((pat->prev = inf->ppat) != NULL) + inf->ppat->next = pat; + else + inf->palt->pat = pat; + inf->palt = alt; + inf->ppat = NULL; + + /* Only toplevel parenthesis supported */ + if (++inf->nparens == 1) + ++inf->ngrps; + + inf->pgrp = grp; +} + +static void +irec_close_group(irec_info *inf) +{ + if (inf->pgrp == NULL) { + inf->ecode = RE_EPAREN; + return; + } + + inf->palt = inf->pgrp->palt; + inf->ppat = inf->pgrp->parent; + inf->pgrp = inf->pgrp->pgrp; + + --inf->nparens; +} + +static void +irec_range(irec_info *inf) +{ + int count; + rec_pat *pat; + rec_rng *rng; + int not = inf->ptr[0] == '^'; + + if (not) + ++inf->ptr; + + pat = calloc(1, sizeof(rec_pat)); + if (pat == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + rng = calloc(1, sizeof(rec_rng)); + if (pat == NULL) { + free(pat); + inf->ecode = RE_ESPACE; + return; + } + + pat->data.rng = rng; + pat->type = not ? Rep_RangeNot : Rep_Range; + if ((pat->prev = inf->ppat) != NULL) + inf->ppat->next = pat; + else + inf->palt->pat = pat; + inf->ppat = pat; + + /* First pass, add everything seen */ + for (count = 0; inf->ecode == 0; count++) { + /* If bracket not closed */ + if (inf->ptr == inf->end) { + inf->ecode = RE_EBRACK; + return; + } + /* If not the first character */ + else if (inf->ptr[0] == ']' && count) + break; + else { + /* If not a range of characters */ + if (inf->ptr[1] != '-' || inf->ptr[2] == ']') { + irec_range_single(inf, inf->ptr[0]); + ++inf->ptr; + } + else { + if ((inf->flags & RE_NEWLINE) && + inf->ptr[0] < '\n' && inf->ptr[2] > '\n') { + /* Unless it is forced to be a delimiter, don't allow + * a newline in a character range */ + if (inf->ptr[0] == '\n' - 1) + irec_range_single(inf, inf->ptr[0]); + else + irec_range_complex(inf, inf->ptr[0], '\n' - 1); + if (inf->ptr[2] == '\n' + 1) + irec_range_single(inf, inf->ptr[2]); + else + irec_range_complex(inf, '\n' + 1, inf->ptr[2]); + } + else + irec_range_complex(inf, inf->ptr[0], inf->ptr[2]); + inf->ptr += 3; + } + } + } + + /* Skip ] */ + ++inf->ptr; +} + +static void +irec_range_single(irec_info *inf, int value) +{ + if (value >= 0 && value <= 255) + inf->ppat->data.rng->range[value] = 1; + + if (inf->flags & RE_ICASE) { + if (islower(value)) { + value = toupper(value); + if (value >= 0 && value <= 255) + inf->ppat->data.rng->range[value] = 1; + } + else if (isupper(value)) { + value = tolower(value); + if (value >= 0 && value <= 255) + inf->ppat->data.rng->range[value] = 1; + } + } +} + +static void +irec_range_complex(irec_info *inf, int chrf, int chrt) +{ + if (chrf > chrt) { + inf->ecode = RE_ERANGE; + return; + } + + for (; chrf <= chrt; chrf++) + irec_range_single(inf, chrf); +} + +static void +irec_escape(irec_info *inf) +{ + rec_pat *pat; + unsigned char chr = inf->ptr[0]; + + if (chr == 0) { + inf->ecode = RE_EESCAPE; + return; + } + ++inf->ptr; + switch (chr) { + case 'o': + irec_simple_pattern(inf, Rep_Odigit); + break; + case 'O': + irec_simple_pattern(inf, Rep_OdigitNot); + break; + case 'd': + irec_simple_pattern(inf, Rep_Digit); + break; + case 'D': + irec_simple_pattern(inf, Rep_DigitNot); + break; + case 'x': + irec_simple_pattern(inf, Rep_Xdigit); + break; + case 'X': + irec_simple_pattern(inf, Rep_XdigitNot); + break; + case 's': + irec_simple_pattern(inf, Rep_Space); + break; + case 'S': + irec_simple_pattern(inf, Rep_SpaceNot); + break; + case 't': + irec_simple_pattern(inf, Rep_Tab); + break; + case 'n': + irec_simple_pattern(inf, Rep_Newline); + break; + case 'l': + irec_simple_pattern(inf, Rep_Lower); + break; + case 'u': + irec_simple_pattern(inf, Rep_Upper); + break; + case 'w': + irec_simple_pattern(inf, Rep_Alnum); + break; + case 'W': + irec_simple_pattern(inf, Rep_AlnumNot); + break; + case 'c': + irec_simple_pattern(inf, Rep_Control); + break; + case 'C': + irec_simple_pattern(inf, Rep_ControlNot); + break; + case '<': + irec_simple_pattern(inf, Rep_Bow); + break; + case '>': + irec_simple_pattern(inf, Rep_Eow); + break; + case '1': case '2': case '3': + case '4': case '5': case '6': + case '7': case '8': case '9': + if ((inf->flags & RE_NOSUB) || (chr -= '1') >= inf->ngrps) { + inf->ecode = RE_ESUBREG; + return; + } + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + pat->type = Rep_Backref; + pat->data.chr = chr; + pat->prev = inf->ppat; + if (inf->ppat) + inf->ppat->next = pat; + else + inf->palt->pat = pat; + inf->ppat = pat; + break; + + /* True literals */ + case '0': + irec_literal_pattern(inf, '\0'); + break; + case 'a': + irec_literal_pattern(inf, '\a'); + break; + case 'b': + irec_literal_pattern(inf, '\b'); + break; + case 'f': + irec_literal_pattern(inf, '\f'); + break; + case 'r': + irec_literal_pattern(inf, '\r'); + break; + case 'v': + irec_literal_pattern(inf, '\v'); + break; + + default: + /* Don't check if case insensitive regular expression */ + irec_literal_pattern(inf, chr); + break; + } +} + +static void +irec_simple_repetition(irec_info *inf, rec_rep_t type) +{ + rec_rep *rep; + + /* If nowhere to add repetition */ + if ((inf->pgrp == NULL && inf->ppat == NULL) || + /* If repetition already added to last/current pattern */ + (inf->pgrp == NULL && inf->ppat->rep != NULL) || + /* If repetition already added to last/current group */ + (inf->ppat == NULL && inf->pgrp->parent->rep != NULL)) { + inf->ecode = RE_BADRPT; + return; + } + + if ((rep = calloc(1, sizeof(rec_rep))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + rep->type = type; + irec_add_repetition(inf, rep); +} + +static void +irec_complex_repetition(irec_info *inf) +{ + int exact; + rec_rep *rep; + long mine, maxc; + unsigned char *end; + + /* If nowhere to add repetition */ + if ((inf->pgrp == NULL && inf->ppat == NULL) || + /* If repetition already added to last/current pattern */ + (inf->pgrp == NULL && inf->ppat->rep != NULL) || + /* If repetition already added to last/current group */ + (inf->ppat == NULL && inf->pgrp->parent->rep != NULL)) { + inf->ecode = RE_EBADBR; + return; + } + + exact = 0; + mine = maxc = -1; + if (inf->ptr[0] == ',') + /* Specify max number of ocurrences only */ + goto domax; + else if (!isdigit(inf->ptr[0])) + goto badbr; + + mine = strtol((char*)inf->ptr, (char**)&end, 10); + inf->ptr = end; + if (inf->ptr[0] == '}') { + exact = 1; + ++inf->ptr; + goto redone; + } + else if (inf->ptr[0] != ',') + goto badbr; + +domax: + /* Add one to skip comma */ + ++inf->ptr; + if (inf->ptr[0] == '}') { + ++inf->ptr; + goto redone; + } + else if (!isdigit(inf->ptr[0])) + goto badbr; + maxc = strtol((char*)inf->ptr, (char**)&end, 10); + inf->ptr = end; + if (inf->ptr[0] != '}') + goto badbr; + ++inf->ptr; + +redone: + if (mine == maxc) { + maxc = -1; + exact = 1; + } + + /* Check range and if min-max parameters are valid */ + if (mine >= 255 || maxc >= 255 || + (mine >= 0 && maxc >= 0 && mine > maxc)) + goto badbr; + + /* Check for noop */ + if (exact && mine == 1) + return; + + if ((rep = calloc(1, sizeof(rec_rep))) == NULL) { + inf->ecode = RE_ESPACE; + return; + } + + /* Convert {0,1} to ? */ + if (mine == 0 && maxc == 1) + rep->type = Rer_Maybe; + else if (exact) { + rep->type = Rer_Exact; + rep->mine = mine; + } + /* Convert {0,} to * */ + else if (mine == 0 && maxc == -1) + rep->type = Rer_AnyTimes; + /* Convert {1,} to + */ + else if (mine == 1 && maxc == -1) + rep->type = Rer_AtLeast; + else if (maxc == -1) { + rep->type = Rer_Min; + rep->mine = mine; + } + else if (mine < 1) { + rep->type = Rer_Max; + rep->maxc = maxc; + } + else { + rep->type = Rer_MinMax; + rep->mine = mine; + rep->maxc = maxc; + } + + irec_add_repetition(inf, rep); + + return; + +badbr: + inf->ecode = RE_EBADBR; +} + +/* The rep argument is allocated and has no reference yet, + * if something fails it must be freed before returning. + */ +static void +irec_add_repetition(irec_info *inf, rec_rep *rep) +{ + int length; + rec_pat *pat; + rec_grp *grp; + rec_rep_t rept; + unsigned char value, upper; + + rept = rep->type; + + if (inf->ppat == NULL) { + rec_pat *any; + rec_grp *grp = inf->pgrp; + + if (rept == Rer_AnyTimes || rept == Rer_Maybe || rept == Re_AtLeast) { + /* Convert (.)* to (.*), ((.))* not handled and may not match */ + any = NULL; + + if (grp->alt && grp->alt->pat) { + for (any = grp->alt->pat; any->next; any = any->next) + ; + switch (any->type) { + case Rep_Any: + break; + case Rep_AnyAnyTimes: + case Rep_AnyMaybe: + case Rep_AnyAtLeast: + free(rep); + inf->ecode = RE_BADRPT; + return; + default: + any = NULL; + break; + } + } + if (any) { + free(rep); + rep = NULL; + any->type = (rept == Rer_AnyTimes) ? Rep_AnyAnyTimes : + (rept == Rer_AtLeast) ? Rep_AnyAtLeast : + Rep_AnyMaybe; + while (grp) { + ++grp->comp; + grp = grp->pgrp; + } + } + } + inf->pgrp->parent->rep = rep; + irec_close_group(inf); + return; + } + + switch (inf->ppat->type) { + case Rep_Bol: + case Rep_Eol: + case Rep_Bow: + case Rep_Eow: + case Rep_AnyAnyTimes: + case Rep_AnyMaybe: + case Rep_AnyAtLeast: + /* Markers that cannot repeat */ + free(rep); + inf->ecode = RE_BADRPT; + return; + + case Rep_Any: + grp = inf->pgrp; + free(rep); + if (rept == Rer_AnyTimes || + rept == Rer_Maybe || + rept == Rer_AtLeast) { + inf->ppat->type = (rept == Rer_AnyTimes) ? + Rep_AnyAnyTimes : + (rept == Rer_Maybe) ? + Rep_AnyMaybe : + Rep_AnyAtLeast; + while (grp) { + ++grp->comp; + grp = grp->pgrp; + } + } + else + /* XXX Not (yet) implemented */ + inf->ecode = RE_BADRPT; + rep = NULL; + break; + + case Rep_String: + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + free(rep); + inf->ecode = RE_ESPACE; + return; + } + + length = strlen((char*)inf->ppat->data.str); + pat->type = Rep_Literal; + pat->prev = inf->ppat; + pat->data.chr = inf->ppat->data.str[length - 1]; + if (length == 2) { + /* Must convert to two Rep_Literals */ + value = inf->ppat->data.str[0]; + free(inf->ppat->data.str); + inf->ppat->data.chr = value; + inf->ppat->type = Rep_Literal; + } + else + /* Must remove last character from string */ + inf->ppat->data.str[length - 1] = '\0'; + inf->ppat->next = pat; + inf->ppat = pat; + break; + + case Rep_CaseString: + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + free(rep); + inf->ecode = RE_ESPACE; + return; + } + + length = strlen((char*)inf->ppat->data.str); + pat->type = Rep_CaseLiteral; + pat->prev = inf->ppat; + pat->data.cse.lower = inf->ppat->data.str[length - 2]; + pat->data.cse.upper = inf->ppat->data.str[length - 1]; + if (length == 4) { + /* Must convert to two Rep_CaseLiterals */ + value = inf->ppat->data.str[0]; + upper = inf->ppat->data.str[1]; + free(inf->ppat->data.str); + inf->ppat->data.cse.lower = value; + inf->ppat->data.cse.upper = upper; + inf->ppat->next = pat; + inf->ppat->type = Rep_CaseLiteral; + } + else + /* Must remove last character pair from string */ + inf->ppat->data.str[length - 2] = '\0'; + inf->ppat->next = pat; + inf->ppat = pat; + break; + + default: + /* Anything else does not need special handling */ + break; + } + + inf->ppat->rep = rep; +} + +static void +irec_free(irec_info *inf) +{ + irec_free_alt(inf->alt); +} + +static void +irec_free_grp(rec_grp *grp) +{ + if (grp->alt) + irec_free_alt(grp->alt); + free(grp); +} + +static void +irec_free_pats(rec_pat *pat) +{ + rec_pat *next; + rec_pat_t rect; + + while (pat) { + next = pat->next; + if (pat->rep) + free(pat->rep); + rect = pat->type; + if (rect == Rep_Range || rect == Rep_RangeNot) + free(pat->data.rng); + else if (rect == Rep_Group) + irec_free_grp(pat->data.grp); + else if (rect == Rep_StringList) + orec_free_stl(pat->data.stl); + free(pat); + pat = next; + } +} diff --git a/lisp/re/reo.c b/lisp/re/reo.c new file mode 100644 index 0000000..59cbf3b --- /dev/null +++ b/lisp/re/reo.c @@ -0,0 +1,685 @@ +/* + * Copyright (c) 2002 by The XFree86 Project, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF + * OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Except as contained in this notice, the name of the XFree86 Project shall + * not be used in advertising or otherwise to promote the sale, use or other + * dealings in this Software without prior written authorization from the + * XFree86 Project. + * + * Author: Paulo César Pereira de Andrade + */ + +/* $XFree86: xc/programs/xedit/lisp/re/reo.c,v 1.9 2002/11/15 07:01:33 paulo Exp $ */ + +#include "rep.h" + +/* + * This file is a placeholder to add code to analyse and optimize the + * intermediate data structure generated in rep.c. + * Character ranges are optimized while being generated. + */ + +/* + * Types + */ +typedef struct _orec_inf { + rec_alt *alt; /* Main alternatives list */ + rec_grp *grp; /* Current group pointer */ + int flags; + int ecode; +} orec_inf; + +/* + * Prototypes + */ +static int orec_alt(orec_inf*, rec_alt*); +static int orec_pat(orec_inf*, rec_pat*); +static int orec_grp(orec_inf*, rec_grp*); +static int orec_pat_bad_rpt(orec_inf*, rec_pat*); +static int orec_pat_bad_forward_rpt(orec_inf*, rec_pat*); +static int orec_pat_rng(orec_inf*, rec_pat*); +static int orec_pat_cse(orec_inf*, rec_pat*); +static int orec_pat_cse_can(orec_inf*, rec_pat*); +static int orec_str_list(orec_inf*, rec_alt*, int, int); + +/* + * Initialization + */ +extern unsigned char re__alnum[256]; +extern unsigned char re__odigit[256]; +extern unsigned char re__ddigit[256]; +extern unsigned char re__xdigit[256]; +extern unsigned char re__control[256]; + +/* + * Implementation + */ +int +orec_comp(rec_alt *alt, int flags) +{ + orec_inf inf; + + inf.alt = alt; + inf.grp = NULL; + inf.flags = flags; + inf.ecode = 0; + + orec_alt(&inf, alt); + + return (inf.ecode); +} + +void +orec_free_stl(rec_stl *stl) +{ + int i; + + for (i = 0; i < stl->nstrs; i++) { + if (stl->lens[i] > 2) + free(stl->strs[i]); + } + + free(stl->lens); + free(stl->strs); + free(stl); +} + + +static int +orec_alt(orec_inf *inf, rec_alt *alt) +{ + if (alt) { + rec_alt *ptr = alt; + int ret, count = 0, str = 1, cstr = 1, lits = 0, clits = 0; + + /* Check if can build a string list */ + if (ptr->next) { + /* If more than one alternative */ + while (ptr && (str || cstr)) { + if (ptr->pat == NULL || ptr->pat->rep != NULL) { + cstr = str = 0; + break; + } + if ((inf->flags & RE_ICASE)) { + if (!(ret = orec_pat_cse_can(inf, ptr->pat))) { + cstr = str = 0; + break; + } + if (ret == 1) + ++lits; + else if (ret == 2) + ++clits; + } + else if (ptr->pat->next == NULL) { + if (ptr->pat->type != Rep_String) { + if (ptr->pat->type != Rep_Literal) { + str = 0; + if (ptr->pat->type != Rep_CaseString) { + if (ptr->pat->type != Rep_CaseLiteral) + cstr = 0; + else + ++clits; + } + else if (strlen((char*)ptr->pat->data.str) >= 255) + str = cstr = 0; + } + else + ++lits; + } + else if (strlen((char*)ptr->pat->data.str) >= 255) + str = cstr = 0; + } + else { + str = cstr = 0; + break; + } + if (++count >= 255) + str = cstr = 0; + ptr = ptr->next; + } + + if (str || cstr) { + if (inf->flags & RE_ICASE) { + for (ptr = alt; ptr; ptr = ptr->next) { + if (orec_pat_cse(inf, ptr->pat)) + return (inf->ecode); + } + str = 0; + } + return (orec_str_list(inf, alt, str, count)); + } + } + else if (alt == inf->alt && alt->pat && alt->pat->rep == NULL) { + /* If the toplevel single alternative */ + switch (alt->pat->type) { + /* One of these will always be true for RE_NOSPEC, + * but can also be optimized for simple patterns */ + case Rep_Literal: + alt->pat->type = Rep_SearchLiteral; + break; + case Rep_CaseLiteral: + alt->pat->type = Rep_SearchCaseLiteral; + break; + case Rep_String: + alt->pat->type = Rep_SearchString; + break; + case Rep_CaseString: + alt->pat->type = Rep_SearchCaseString; + break; + default: + break; + } + } + + while (alt) { + orec_pat(inf, alt->pat); + alt = alt->next; + } + } + + return (inf->ecode); +} + +static int +orec_pat(orec_inf *inf, rec_pat *pat) +{ + rec_pat *next; + + while (pat) { + switch (pat->type) { + case Rep_AnyAnyTimes: + if (pat->next == NULL) { + rec_grp *grp = inf->grp; + + next = NULL; + while (grp) { + next = grp->parent->next; + /* Cannot check if is .*$ as the input + * may be a substring */ + if (next) + break; + grp = grp->pgrp; + } + if (next == NULL) { + /* .* */ + pat->type = Rep_AnyEatAnyTimes; + grp = inf->grp; + while (grp) { + --grp->comp; + next = grp->parent->next; + if (next) + break; + grp = grp->pgrp; + } + } + else if (orec_pat_bad_rpt(inf, next)) + return (inf->ecode); + } + else if (orec_pat_bad_rpt(inf, pat->next)) + return (inf->ecode); + break; + case Rep_AnyMaybe: + if (pat->next == NULL) { + rec_grp *grp = inf->grp; + + next = NULL; + while (grp) { + next = grp->parent->next; + if (next) + break; + grp = grp->pgrp; + } + if (next == NULL) { + /* .? */ + pat->type = Rep_AnyEatMaybe; + grp = inf->grp; + while (grp) { + --grp->comp; + next = grp->parent->next; + if (next) + break; + grp = grp->pgrp; + } + } + else if (orec_pat_bad_rpt(inf, next)) + return (inf->ecode); + } + else if (orec_pat_bad_rpt(inf, pat->next)) + return (inf->ecode); + break; + case Rep_AnyAtLeast: + if (pat->next == NULL) { + rec_grp *grp = inf->grp; + + next = NULL; + while (grp) { + next = grp->parent->next; + if (next) + break; + grp = grp->pgrp; + } + if (next == NULL) { + /* .+ */ + pat->type = Rep_AnyEatAtLeast; + grp = inf->grp; + while (grp) { + --grp->comp; + next = grp->parent->next; + if (next) + break; + grp = grp->pgrp; + } + } + else if (orec_pat_bad_rpt(inf, next)) + return (inf->ecode); + } + else if (orec_pat_bad_rpt(inf, pat->next)) + return (inf->ecode); + break; + case Rep_Range: + case Rep_RangeNot: + orec_pat_rng(inf, pat); + break; + case Rep_Group: + orec_grp(inf, pat->data.grp); + break; + default: + break; + } + pat = pat->next; + } + + return (inf->ecode); +} + +static int +orec_pat_bad_rpt(orec_inf *inf, rec_pat *pat) +{ + switch (pat->type) { + /* Not really an error, but aren't supported by the library. + * Includes: .*.*, .+? .**, (.*)(*), etc. + */ + + /* Not a repetition, but mathes anything... */ + case Rep_Any: + + /* Zero length matches */ + case Rep_Eol: + if (!(inf->flags & RE_NEWLINE)) + break; + case Rep_Bol: + case Rep_Bow: + case Rep_Eow: + + /* Repetitions */ + case Rep_AnyAnyTimes: + case Rep_AnyMaybe: + case Rep_AnyAtLeast: + inf->ecode = RE_BADRPT; + break; + + /* Check if the first group element is a complex pattern */ + case Rep_Group: + if (pat->rep == NULL) { + if (pat->data.grp->alt) { + for (pat = pat->data.grp->alt->pat; pat; pat = pat->next) { + if (orec_pat_bad_rpt(inf, pat)) + break; + } + } + break; + } + /*FALLTHROUGH*/ + default: + if (pat->rep) + inf->ecode = RE_BADRPT; + break; + } + + if (!inf->ecode && pat && pat->next) + orec_pat_bad_forward_rpt(inf, pat->next); + + return (inf->ecode); +} + +static int +orec_pat_bad_forward_rpt(orec_inf *inf, rec_pat *pat) +{ + if (pat->rep) { + switch (pat->rep->type) { + case Rer_MinMax: + if (pat->rep->mine > 0) + break; + case Rer_AnyTimes: + case Rer_Maybe: + case Rer_Max: + inf->ecode = RE_BADRPT; + default: + break; + } + } + else if (pat->type == Rep_Group && + pat->data.grp->alt && + pat->data.grp->alt->pat) + orec_pat_bad_forward_rpt(inf, pat->data.grp->alt->pat); + + return (inf->ecode); +} + +static int +orec_grp(orec_inf *inf, rec_grp *grp) +{ + rec_grp *prev = inf->grp; + + inf->grp = grp; + orec_alt(inf, grp->alt); + /* Could also just say: inf->grp = grp->gparent */ + inf->grp = prev; + + return (inf->ecode); +} + +static int +orec_pat_rng(orec_inf *inf, rec_pat *pat) +{ + int i, j[2], count; + rec_pat_t type = pat->type; + unsigned char *range = pat->data.rng->range; + + for (i = count = j[0] = j[1] = 0; i < 256; i++) { + if (range[i]) { + if (count == 2) { + ++count; + break; + } + j[count++] = i; + } + } + + if (count == 1 || + (count == 2 && + ((islower(j[0]) && toupper(j[0]) == j[1]) || + (isupper(j[0]) && tolower(j[0]) == j[1])))) { + free(pat->data.rng); + if (count == 1) { + pat->data.chr = j[0]; + pat->type = type == Rep_Range ? Rep_Literal : Rep_LiteralNot; + } + else { + pat->data.cse.upper = j[0]; + pat->data.cse.lower = j[1]; + pat->type = type == Rep_Range ? Rep_CaseLiteral : Rep_CaseLiteralNot; + } + } + else { + if (memcmp(re__alnum, range, 256) == 0) + type = type == Rep_Range ? Rep_Alnum : Rep_AlnumNot; + else if (memcmp(re__odigit, range, 256) == 0) + type = type == Rep_Range ? Rep_Odigit : Rep_OdigitNot; + else if (memcmp(re__ddigit, range, 256) == 0) + type = type == Rep_Range ? Rep_Digit : Rep_DigitNot; + else if (memcmp(re__xdigit, range, 256) == 0) + type = type == Rep_Range ? Rep_Xdigit : Rep_XdigitNot; + else if (memcmp(re__control, range, 256) == 0) + type = type == Rep_Range ? Rep_Control : Rep_ControlNot; + + if (type != pat->type) { + free(pat->data.rng); + pat->type = type; + } + } + + return (inf->ecode); +} + +/* Join patterns if required, will only fail on memory allocation failure: + */ +static int +orec_pat_cse(orec_inf *inf, rec_pat *pat) +{ + rec_pat_t type; + int i, len, length; + rec_pat *ptr, *next; + unsigned char *str, *tofree; + + if (pat->next == NULL && pat->type == Rep_CaseString) + return (inf->ecode); + + type = Rep_CaseString; + + /* First calculate how many bytes will be required */ + for (ptr = pat, length = 1; ptr; ptr = ptr->next) { + switch (ptr->type) { + case Rep_Literal: + length += 2; + break; + case Rep_String: + length += strlen((char*)ptr->data.str) << 1; + break; + case Rep_CaseLiteral: + length += 2; + break; + case Rep_CaseString: + length += strlen((char*)ptr->data.str); + break; + default: + break; + } + } + + if ((str = malloc(length)) == NULL) + return (inf->ecode = RE_ESPACE); + + for (ptr = pat, length = 0; ptr; ptr = next) { + tofree = NULL; + next = ptr->next; + switch (ptr->type) { + case Rep_Literal: + str[length++] = ptr->data.chr; + str[length++] = ptr->data.chr; + break; + case Rep_String: + tofree = ptr->data.str; + len = strlen((char*)tofree); + for (i = 0; i < len; i++) { + str[length++] = tofree[i]; + str[length++] = tofree[i]; + } + break; + case Rep_CaseLiteral: + str[length++] = ptr->data.cse.lower; + str[length++] = ptr->data.cse.upper; + break; + case Rep_CaseString: + tofree = ptr->data.str; + len = strlen((char*)tofree); + memcpy(str + length, tofree, len); + length += len; + break; + default: + break; + } + if (tofree) + free(tofree); + if (ptr != pat) + free(ptr); + } + str[length] = '\0'; + + pat->type = type; + pat->data.str = str; + pat->next = NULL; + + return (inf->ecode); +} + +/* Return 0 if the patterns in the list cannot be merged, 1 if will + * be a simple string, 2 if a case string. + * This is useful when building an alternative list that is composed + * only of strings, but the regex is case insensitive, in wich case + * the first pass may have splited some patterns, but if it is a member + * of an alternatives list, the cost of using a string list is smaller */ +static int +orec_pat_cse_can(orec_inf *inf, rec_pat *pat) +{ + int ret; + + if (pat == NULL) + return (0); + + for (ret = 1; pat; pat = pat->next) { + if (pat->rep) + return (0); + switch (pat->type) { + case Rep_Literal: + case Rep_String: + break; + case Rep_CaseLiteral: + case Rep_CaseString: + ret = 2; + break; + default: + return (0); + } + } + + return (ret); +} + + +/* XXX If everything is a (case) byte, the pattern should be + * [abcde] instead of a|b|c|d|e (or [aAbBcCdDeE] instead of aA|bB|cC|dD|eE) + * as a string list works fine, but as a character range + * should be faster, and maybe could be converted here. But not + * very important, if performance is required, it should have already + * been done in the pattern. + */ +static int +orec_str_list(orec_inf *inf, rec_alt *alt, int str, int count) +{ + rec_stl *stl; + rec_pat *pat; + rec_alt *ptr, *next; + int i, j, tlen, len, is; + + if ((stl = calloc(1, sizeof(rec_stl))) == NULL) + return (inf->ecode = RE_ESPACE); + + if ((stl->lens = malloc(sizeof(unsigned char) * count)) == NULL) { + free(stl); + return (inf->ecode = RE_ESPACE); + } + + if ((stl->strs = malloc(sizeof(char*) * count)) == NULL) { + free(stl->lens); + free(stl); + return (inf->ecode = RE_ESPACE); + } + + if ((pat = calloc(1, sizeof(rec_pat))) == NULL) { + free(stl->strs); + free(stl->lens); + free(stl); + return (inf->ecode = RE_ESPACE); + } + + pat->data.stl = stl; + pat->type = Rep_StringList; + stl->type = str ? Resl_StringList : Resl_CaseStringList; + for (i = tlen = 0, ptr = alt; i < count; i++) { + next = ptr->next; + switch (ptr->pat->type) { + case Rep_Literal: + is = len = 1; + break; + case Rep_CaseLiteral: + is = len = 2; + break; + default: + is = 0; + len = strlen((char*)ptr->pat->data.str); + break; + } + tlen += len; + stl->lens[i] = len; + if (!is) { + if (len > 2) + stl->strs[i] = ptr->pat->data.str; + else { + if (len == 1) + stl->strs[i] = (void*)(long)(ptr->pat->data.str[0]); + else + stl->strs[i] = (void*)(long) + (ptr->pat->data.str[0] | + ((int)ptr->pat->data.str[1] << 8)); + free(ptr->pat->data.str); + } + } + else { + if (is == 1) + stl->strs[i] = (void*)(long)ptr->pat->data.chr; + else + stl->strs[i] = (void*)(long) + (ptr->pat->data.cse.lower | + (ptr->pat->data.cse.upper << 8)); + } + free(ptr->pat); + if (i) + free(ptr); + ptr = next; + } + stl->tlen = tlen; + stl->nstrs = count; + + alt->pat = pat; + alt->next = NULL; + + { + int li, lj; + unsigned char ci, cj, *str; + + /* Don't need a stable sort, there shouldn't be duplicated strings, + * but don't check for it either. Only need to make sure that all + * strings that start with the same byte are together */ + for (i = 0; i < count; i++) { + li = stl->lens[i]; + ci = li > 2 ? stl->strs[i][0] : (long)stl->strs[i] & 0xff; + for (j = i + 1; j < count; j++) { + lj = stl->lens[j]; + cj = lj > 2 ? stl->strs[j][0] : (long)stl->strs[j] & 0xff; + if ((count >= LARGE_STL_COUNT && cj < ci) || + (cj == ci && lj > li)) { + /* If both strings start with the same byte, + * put the longer first */ + str = stl->strs[j]; + stl->strs[j] = stl->strs[i]; + stl->strs[i] = str; + stl->lens[j] = li; + stl->lens[i] = lj; + li ^= lj; lj ^= li; li ^= lj; + ci ^= cj; cj ^= ci; ci ^= cj; + } + } + } + } + + return (inf->ecode); +} diff --git a/lisp/re/rep.h b/lisp/re/rep.h new file mode 100644 index 0000000..5e4d5d5 --- /dev/null +++ b/lisp/re/rep.h @@ -0,0 +1,369 @@ +/* + * Copyright (c) 2002 by The XFree86 Project, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF + * OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Except as contained in this notice, the name of the XFree86 Project shall + * not be used in advertising or otherwise to promote the sale, use or other + * dealings in this Software without prior written authorization from the + * XFree86 Project. + * + * Author: Paulo César Pereira de Andrade + */ + +/* $XFree86: xc/programs/xedit/lisp/re/rep.h,v 1.3 2002/11/25 02:35:32 paulo Exp $ */ + +#include "re.h" + +#ifndef _rep_h +#define _rep_h + +/* + * Local defines + */ + +#ifdef MIN +#undef MIN +#endif +#define MIN(a, b) ((a) < (b) ? (a) : (b)) + +#ifdef MAX +#undef MAX +#endif +#define MAX(a, b) ((a) > (b) ? (a) : (b)) + +/* This value can not be larger than 255, a depth value is the nesting of + * repetition operations and alternatives. The number of nested parenthesis + * does not matter, but a repetition on the pattern inside the parenthesis + * does. Note also that you cannot have more than 9 parenthesis pairs in + * an expression. + * Depth is always at least 1. So for MAX_DEPTH 8, it is only allowed + * 7 complex repetitions. A complex repetition is a dot followed by an + * repetition operator. It is called a complex repetition because dot + * matches anything but the empty string, so the engine needs to test + * all possible combinations until the end of the string is found. + * Repetitions like .* use one depth until the end of the string is found, + * for example a.*b.*c.*d has depth 4, while a*b*c*d has depth 2. + */ +#define MAX_DEPTH 8 + +/* Minimum number of strings to generate a "large" string list, that is, + * sort the strings and allocate 512 extra bytes to map the first string + * with a given initial byte. */ +#define LARGE_STL_COUNT 16 + +/* + * Local types + */ +/* Intermediate compilation types declaration */ + /* (r)egular (e)xpression (c)ompile (c)a(se) */ +typedef struct _rec_cse rec_cse; + + /* (r)egular (e)xpression (c)ompile (r)a(ng)e */ +typedef struct _rec_rng rec_rng; + + /* (r)egular (e)xpression (c)ompile (pat)tern */ +typedef struct _rec_pat rec_pat; + + /* (r)egular (e)xpression (c)ompile (rep)etition */ +typedef struct _rec_rep rec_rep; + + /* (r)egular (e)xpression (c)ompile (gr)ou(p) */ +typedef struct _rec_grp rec_grp; + + /* (r)egular (e)xpression (c)ompile (alt)ernatives */ +typedef struct _rec_alt rec_alt; + + +/* Optimization types */ + /* (r)egular (e)xpression (c)ompile (st)ring (l)ist */ +typedef struct _rec_stl rec_stl; + +/* Final compilation and execution types */ + /* (re)gular expression (inf)ormation */ +typedef struct _re_inf re_inf; + + /* (re)gular expression (eng)ine */ +typedef struct _re_eng re_eng; + + +/* Codes used by the engine */ +typedef enum { + /* Grouping */ + Re_Open, /* ( */ + Re_Close, /* ) */ + Re_Update, /* Like Re_Close, but is inside a loop */ + + /* Alternatives */ + Re_Alt, /* Start alternative list, + next offset */ + Re_AltNext, /* Next alternative, + next offset */ + Re_AltDone, /* Finish alternative list */ + + /* Repetition */ + Re_AnyTimes, /* * */ + Re_Maybe, /* ? */ + Re_AtLeast, /* +, at least one */ + + /* Repetition like */ + Re_AnyAnyTimes, /* .* */ + Re_AnyMaybe, /* .? */ + Re_AnyAtLeast, /* .+ */ + + Re_AnyEatAnyTimes, /* Expression ends with .* */ + Re_AnyEatMaybe, /* Expression ends with .? */ + Re_AnyEatAtLeast, /* Expression ends with .+ */ + + /* Repetition with arguments */ + Re_Exact, /* {e} */ + Re_Min, /* {n,} */ + Re_Max, /* {,m} */ + Re_MinMax, /* {n,m} */ + + /* Repetition helper instruction */ + Re_RepJump, /* Special code, go back to repetition */ + Re_RepLongJump, /* Jump needs two bytes */ + /* After the repetition data, all repetitions have an offset + * to the code after the repetition */ + + /* Matching */ + Re_Any, /* . */ + Re_Odigit, /* \o */ + Re_OdigitNot, /* \O */ + Re_Digit, /* \d */ + Re_DigitNot, /* \D */ + Re_Xdigit, /* \x */ + Re_XdigitNot, /* \x */ + Re_Space, /* \s */ + Re_SpaceNot, /* \S */ + Re_Tab, /* \t */ + Re_Newline, /* \n */ + Re_Lower, /* \l */ + Re_Upper, /* \u */ + Re_Alnum, /* \w */ + Re_AlnumNot, /* \W */ + Re_Control, /* \c */ + Re_ControlNot, /* \C */ + Re_Bol, /* ^ */ + Re_Eol, /* $ */ + Re_Bow, /* \< */ + Re_Eow, /* \> */ + + /* Range matching information */ + Re_Range, /* + 256 bytes */ + Re_RangeNot, /* + 256 bytes */ + + /* Matching with arguments */ + Re_Literal, /* + character */ + Re_CaseLiteral, /* + lower + upper */ + Re_LiteralNot, /* + character */ + Re_CaseLiteralNot, /* + lower + upper */ + Re_String, /* + length + string */ + Re_CaseString, /* + length + string in format lower-upper */ + + /* These are useful to start matching, or when RE_NOSPEC is used. */ + Re_SearchLiteral, + Re_SearchCaseLiteral, + Re_SearchString, + Re_SearchCaseString, + + Re_StringList, /* + total-length + lengths + strings */ + Re_CaseStringList, /* + total-length + lengths + strings */ + + Re_LargeStringList, /* + total-length + lengths + map + strings */ + Re_LargeCaseStringList, /* + total-length + lengths + map + strings */ + + /* Backreference */ + Re_Backref, /* + reference number */ + + /* The last codes */ + Re_DoneIf, /* Done if at end of input */ + Re_MaybeDone, /* Done */ + Re_Done /* If this code found, finished execution */ +} ReCode; + + +/* (r)egular (e)xpresssion (pat)rern (t)ype */ +typedef enum _rec_pat_t { + Rep_Literal = Re_Literal, + Rep_CaseLiteral = Re_CaseLiteral, + Rep_LiteralNot = Re_LiteralNot, + Rep_CaseLiteralNot = Re_CaseLiteralNot, + Rep_Range = Re_Range, + Rep_RangeNot = Re_RangeNot, + Rep_String = Re_String, + Rep_CaseString = Re_CaseString, + Rep_SearchLiteral = Re_SearchLiteral, + Rep_SearchCaseLiteral = Re_SearchCaseLiteral, + Rep_SearchString = Re_SearchString, + Rep_SearchCaseString = Re_SearchCaseString, + Rep_Any = Re_Any, + Rep_AnyAnyTimes = Re_AnyAnyTimes, + Rep_AnyEatAnyTimes = Re_AnyEatAnyTimes, + Rep_AnyMaybe = Re_AnyMaybe, + Rep_AnyEatMaybe = Re_AnyEatMaybe, + Rep_AnyAtLeast = Re_AnyAtLeast, + Rep_AnyEatAtLeast = Re_AnyEatAtLeast, + Rep_Odigit = Re_Odigit, + Rep_OdigitNot = Re_OdigitNot, + Rep_Digit = Re_Digit, + Rep_DigitNot = Re_DigitNot, + Rep_Xdigit = Re_Xdigit, + Rep_XdigitNot = Re_XdigitNot, + Rep_Space = Re_Space, + Rep_SpaceNot = Re_SpaceNot, + Rep_Tab = Re_Tab, + Rep_Newline = Re_Newline, + Rep_Lower = Re_Lower, + Rep_Upper = Re_Upper, + Rep_Alnum = Re_Alnum, + Rep_AlnumNot = Re_AlnumNot, + Rep_Control = Re_Control, + Rep_ControlNot = Re_ControlNot, + Rep_Bol = Re_Bol, + Rep_Eol = Re_Eol, + Rep_Bow = Re_Bow, + Rep_Eow = Re_Eow, + Rep_Backref = Re_Backref, + Rep_StringList = Re_StringList, + Rep_Group = Re_Open +} rec_pat_t; + + +/* (r)egular (e)xpression (rep)etition (t)ype */ +typedef enum _rec_rep_t { + Rer_AnyTimes = Re_AnyTimes, + Rer_AtLeast = Re_AtLeast, + Rer_Maybe = Re_Maybe, + Rer_Exact = Re_Exact, + Rer_Min = Re_Min, + Rer_Max = Re_Max, + Rer_MinMax = Re_MinMax +} rec_rep_t; + + +/* Decide at re compilation time what is lowercase and what is uppercase */ +struct _rec_cse { + unsigned char lower; + unsigned char upper; +}; + + +/* A rec_rng is used only during compilation, just a character map */ +struct _rec_rng { + unsigned char range[256]; +}; + + +/* A rec_pat is used only during compilation, and can be viewed as + * a regular expression element like a match to any character, a match + * to the beginning or end of the line, etc. + * It is implemented as a linked list, and does not have nesting. + * The data field can contain: + * chr: the value of a single character to match. + * cse: the upper and lower case value of a character to match. + * rng: a character map to match or not match. + * str: a simple string or a string where every two bytes + * represents the character to match, in lower/upper + * case sequence. + * The rep field is not used for strings, strings are broken in the + * last character in this case. That is, strings are just a concatenation + * of several character matches. + */ +struct _rec_pat { + rec_pat_t type; + rec_pat *next, *prev; /* Linked list information */ + union { + unsigned char chr; + rec_cse cse; + rec_rng *rng; + rec_grp *grp; + unsigned char *str; + rec_stl *stl; + } data; + rec_rep *rep; /* Pattern repetition information */ +}; + + +/* A rec_rep is used only during compilation, and can be viewed as: + * + * ? or * or + or {} or {,} or {,} or {,} + * + * where is "exact", is "minimum" and is "maximum". + * In the compiled step it can also be just a NULL pointer, that + * is actually equivalent to {1}. + */ +struct _rec_rep { + rec_rep_t type; + short mine; /* minimum or exact number of matches */ + short maxc; /* maximum number of matches */ +}; + + +/* A rec_alt is used only during compilation, and can be viewed as: + * + * | + * + * where is any regular expression. The expressions are nested + * using the grp field of the rec_pat structure. + */ +struct _rec_alt { + rec_alt *next, *prev; /* Linked list information */ + rec_pat *pat; +}; + + +/* A rec_grp is a place holder for expressions enclosed in parenthesis + * and is linked to the compilation data by an rec_pat structure. */ +struct _rec_grp { + rec_pat *parent; /* Reference to parent pattern */ + rec_alt *alt; /* The pattern information */ + rec_alt *palt; /* Parent alternative */ + rec_grp *pgrp; /* Nested groups */ + int comp; /* (comp)lex repetition pattern inside group */ +}; + + +/* Optimization compilation types definition */ + /* (r)egular (e)xpression (c)ompile (st)ring (l)ist (t)ype */ +typedef enum { + Resl_StringList = Re_StringList, + Resl_CaseStringList = Re_CaseStringList +} rec_stl_t; + +struct _rec_stl { + rec_stl_t type; + int nstrs; /* Number of strings in list */ + int tlen; /* Total length of all strings */ + unsigned char *lens; /* Vector of string lengths */ + unsigned char **strs; /* The strings */ +}; + + +/* + * Prototypes + */ + /* rep.c */ +rec_alt *irec_comp(const char*, const char*, int, int*); +void irec_free_alt(rec_alt*); + + /* reo.c */ +int orec_comp(rec_alt*, int); +void orec_free_stl(rec_stl*); + +#endif /* _rep_h */ diff --git a/lisp/re/tests.c b/lisp/re/tests.c new file mode 100644 index 0000000..bd5c55d --- /dev/null +++ b/lisp/re/tests.c @@ -0,0 +1,199 @@ +/* + * Copyright (c) 2002 by The XFree86 Project, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF + * OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Except as contained in this notice, the name of the XFree86 Project shall + * not be used in advertising or otherwise to promote the sale, use or other + * dealings in this Software without prior written authorization from the + * XFree86 Project. + * + * Author: Paulo César Pereira de Andrade + */ + +/* $XFree86: xc/programs/xedit/lisp/re/tests.c,v 1.1 2002/09/08 02:29:50 paulo Exp $ */ + +/* + * Compile with: cc -o tests tests.c -L. -lre + */ + +#include +#include +#include "re.h" + +int +main(int argc, char *argv[]) +{ + re_cod cod; + re_mat mat[10]; + int line, ecode, i, len, group, failed; + long eo, so; + char buf[8192]; + char str[8192]; + FILE *fp = fopen("tests.txt", "r"); + + if (fp == NULL) { + fprintf(stderr, "failed to open tests.txt\n"); + exit(1); + } + + ecode = line = group = failed = 0; + cod.cod = NULL; + while (fgets(buf, sizeof(buf), fp)) { + ++line; + if (buf[0] == '#' || buf[0] == '\n') + continue; + else if (buf[0] == '/') { + char *ptr = strrchr(buf, '/'); + + if (ptr == buf) { + fprintf(stderr, "syntax error at line %d\n", line); + break; + } + else { + int flags = 0; + + refree(&cod); + for (*ptr++ = '\0'; *ptr; ptr++) { + if (*ptr == 'i') + flags |= RE_ICASE; + else if (*ptr == 'n') + flags |= RE_NEWLINE; + } + ecode = recomp(&cod, buf + 1, flags); + failed = ecode; + } + } + else if (buf[0] == '>') { + if (cod.cod == NULL) { + fprintf(stderr, "no previous pattern at line %d\n", line); + break; + } + len = strlen(buf) - 1; + buf[len] = '\0'; + strcpy(str, buf + 1); + for (i = 0, --len; i < len - 1; i++) { + if (str[i] == '\\') { + memmove(str + i, str + i + 1, len); + --len; + switch (str[i]) { + case 'a': + str[i] = '\a'; + break; + case 'b': + str[i] = '\b'; + break; + case 'f': + str[i] = '\f'; + break; + case 'n': + str[i] = '\n'; + break; + case 'r': + str[i] = '\r'; + break; + case 't': + str[i] = '\t'; + break; + case 'v': + str[i] = '\v'; + break; + default: + break; + } + } + } + group = 0; + ecode = reexec(&cod, str, 10, &mat[0], 0); + if (ecode && ecode != RE_NOMATCH) { + reerror(failed, &cod, buf, sizeof(buf)); + fprintf(stderr, "%s, at line %d\n", buf, line); + break; + } + } + else if (buf[0] == ':') { + if (failed) { + len = strlen(buf) - 1; + buf[len] = '\0'; + if (failed == RE_EESCAPE && strcmp(buf, ":EESCAPE") == 0) + continue; + if (failed == RE_ESUBREG && strcmp(buf, ":ESUBREG") == 0) + continue; + if (failed == RE_EBRACK && strcmp(buf, ":EBRACK") == 0) + continue; + if (failed == RE_EPAREN && strcmp(buf, ":EPAREN") == 0) + continue; + if (failed == RE_EBRACE && strcmp(buf, ":EBRACE") == 0) + continue; + if (failed == RE_EBADBR && strcmp(buf, ":EBADBR") == 0) + continue; + if (failed == RE_ERANGE && strcmp(buf, ":ERANGE") == 0) + continue; + if (failed == RE_ESPACE && strcmp(buf, ":ESPACE") == 0) + continue; + if (failed == RE_BADRPT && strcmp(buf, ":BADRPT") == 0) + continue; + if (failed == RE_EMPTY && strcmp(buf, ":EMPTY") == 0) + continue; + reerror(failed, &cod, buf, sizeof(buf)); + fprintf(stderr, "Error value %d doesn't match: %s, at line %d\n", + failed, buf, line); + break; + } + else if (!ecode) { + fprintf(stderr, "found match when shoudn't, at line %d\n", line); + break; + } + } + else { + if (failed) { + reerror(failed, &cod, buf, sizeof(buf)); + fprintf(stderr, "%s, at line %d\n", line); + break; + } + if (sscanf(buf, "%ld,%ld:", &so, &eo) != 2) { + fprintf(stderr, "expecting match offsets at line %d\n", line); + break; + } + else if (ecode) { + fprintf(stderr, "didn't match, at line %d\n", line); + break; + } + else if (group >= 10) { + fprintf(stderr, "syntax error at line %d (too many groups)\n", + line); + break; + } + else if (so != mat[group].rm_so || eo != mat[group].rm_eo) { + fprintf(stderr, "match failed at line %d, got %ld,%ld: ", + line, mat[group].rm_so, mat[group].rm_eo); + if (mat[group].rm_so < mat[group].rm_eo) + fwrite(str + mat[group].rm_so, + mat[group].rm_eo - mat[group].rm_so, 1, stderr); + fputc('\n', stderr); + break; + } + ++group; + } + } + + fclose(fp); + + return (ecode); +} diff --git a/lisp/re/tests.txt b/lisp/re/tests.txt new file mode 100644 index 0000000..e3da032 --- /dev/null +++ b/lisp/re/tests.txt @@ -0,0 +1,461 @@ +# +# Copyright (c) 2002 by The XFree86 Project, Inc. +# +# Permission is hereby granted, free of charge, to any person obtaining a +# copy of this software and associated documentation files (the "Software"), +# to deal in the Software without restriction, including without limitation +# the rights to use, copy, modify, merge, publish, distribute, sublicense, +# and/or sell copies of the Software, and to permit persons to whom the +# Software is furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +# THE XFREE86 PROJECT BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF +# OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +# Except as contained in this notice, the name of the XFree86 Project shall +# not be used in advertising or otherwise to promote the sale, use or other +# dealings in this Software without prior written authorization from the +# XFree86 Project. +# +# Author: Paulo César Pereira de Andrade +# +# +# $XFree86: xc/programs/xedit/lisp/re/tests.txt,v 1.3 2002/11/08 08:01:00 paulo Exp $ + +# Some tests for the library: +# lines starting with # are comments +# lines starting with / are a regular expression pattern +# The pattern must end with / and may be followed by: +# i -> ignore case +# n -> create newline sensitive regex +# lines starting with > are a string input to the last pattern +# To test newline sensitive matching, add \n to the string. +# lines starting with a number are the expected result +# If more than one line, every subsequent line is the +# value of an "subresult". +# :NOMATCH means that the string input should not match + +# Simple string +/abc/ +>abc +0,3: abc +>aaaaaaaaaaaaaaabc +14,17: abc +>xxxxxxxxxxxxxxaaaaaaaaaaaaaaaaabcxx +30,33: abc + +# String list +/abc|bcd|cde/ +>abc +0,3: abc +>aabc +1,4: abc +>xxxbcdef +3,6: bcd +>abdzzzcdabcde +8,11: abc +>xxxxabdecdabdcde +13,16: cde + +# Complex string +/a?bc|ab?c|abc?/ +>abc +0,3: abc +>xxxb +:NOMATCH +>xxxbc +3,5: bc +>sssssab +5,7: ab + +# Another complex string +/a*bc|ab*c|abc*/ +>aaaaaaabc +0,9: aaaaaaabc +>xaaaaaaabc +1,10: aaaaaaabc +>xyzaaaaaaabc +3,12: aaaaaaabc +>abbc +0,4: abbc +>xxabbbbbc +2,9: abbbbbc +>abcccccccccc +0,12: abcccccccccc +>abccccccccccd +0,12: abcccccccccc +>xxxxxxxaaaaaaaaaabbbbbbbbbbbccccccccccc +16,29: abbbbbbbbbbbc +>xxxbbbbbbbbbc +11,13: bc + +# Another complex string +/a+bc|ab+c|abc+/ +>xxxbc +:NOMATCH +>xaaabc +1,6: aaabc +>zzzzaaaaabbc +8,12: abbc +>zzzzaaaabbbbbbcccc +7,15: abbbbbbc + +# Simple pattern +/a.c/ +>abc +0,3: abc +>aaac +1,4: aac +>xac +:NOMATCH +>xaac +1,4: aac +>xxabc +2,5: abc +>xxxaxc +3,6: axc + +# Another simple pattern +/a*c/ +>c +0,1: c +>xxxxxxxxc +8,9: c +>xxxxxxxcc +7,8: c +>ac +0,2: ac +>aaaac +0,5: aaaac +>xac +1,3: ac +>xxxaac +3,6: aac +>xxac +2,4: ac +>xxxxac +4,6: ac + +# Another simple pattern +/a+c/ +>xxaac +2,5: aac +>xxxaaaac +3,8: aaaac +>xaaaabac +6,8: ac +>xxxc +:NOMATCH +>xxxxaaaaccc +4,9: aaaac + +# Another simple pattern +/a{4}b/ +>xabxxaabxxxaaabxxxxaaaab +19,24: aaaab +>aaabaaaab +4,9: aaaab + +# Another simple pattern +/a{4,}b/ +>xxxaaaab +3,8: aaaab +>zaaabzzzaaaaaaaaaaaaaaaab +8,25: aaaaaaaaaaaaaaaab + +# Another simple pattern +/a{,4}b/ +>b +0,1: b +>xxxxxxxxb +8,9: b +>xaaaaaaaaab +6,11: aaaab +>xxxab +3,5: ab +>aaaaaxaaab +6,10: aaab + +# Another simple pattern +/a{2,4}b/ +>xab +:NOMATCH +>xaab +1,4: aab +>xaaab +1,5: aaab +>xxaaaab +2,7: aaaab +>xxxaaaaab +4,9: aaaab + +# Some simple grouping tests +/foo(bar|baz)fee/ +>feebarbazfoobarfee +9,18: foobarfee +12,15: bar +>foofooobazfeefoobazfee +13,22: foobazfee +/f(oo|ee)ba[rz]/ +>barfoebaz +:NOMATCH +>bazfoobar +3,9: foobar +4,6: oo +>barfeebaz +3,9: feebaz +4,6: ee +/\<(int|char)\>/ +>aint character int foo +15,18: int +15,18: int + +# Some complex repetitions +/foo.*bar/ +>barfoblaboofoobarfoobarfoobar +11,17: foobar +/foo.+bar/ +>foobar +:NOMATCH +>fobbarfooxbarfooybar +6,13: fooxbar +/foo.?bar/ +>xfoobar +1,7: foobar +>xxfooxxbar +:NOMATCH +>yyyfootbar +3,10: footbar + +# Some nested complex repetitions +/a.*b.*c/ +>abc +0,3: abc +>xxxxxxxxxabbbbbbbccaaaaabbbc +9,18: abbbbbbbc +/a.+b.*c/ +>xxxabc +:NOMATCH +>xxaxbbc +2,7: axbbc +/a.+b.?c/ +>xaabc +1,5: aabc +>xxaabbc +2,7: aabbc + +# Very complex repetitions +/(foo.*|bar)fee/ +# XXX NOTE +# This pattern does not return the correct offset for the group. +# Support for this may and may not be added. + +>barfoofee +3,9: foofee +>foobarfee +0,9: foobarfee +>xxfobarfee +4,10: barfee +>barfooooooobarfee +3,17: fooooooobarfee +>xxfobarfeefoobar +4,10: barfee +/(foo.+|bar)fee/ +>barfoofee +:NOMATCH +>barfooxfee +3,10: fooxfee +/(foo.?|bar)fee/ +>foobar +:NOMATCH +>bafoofee +2,8:foofee +>bafooofeebarfee +2,9: fooofee +>bafoofeebarfee +2,8: foofee + +# Simple backreference +/(a|b|c)\1/ +>aa +0,2: aa +0,1: a +/(a|b|c)(a|b|c)\1\2/ +>acac +0,4: acac +0,1: a +1,2: c +>xxxxacac +4,8: acac +4,5: a +5,6: c +>xxacabacbcacbbacbcaaccabcaca +24,28: caca +24,25: c +25,26: a +>xyabcccc +4,8: cccc +4,5: c +5,6: c + +# Complex backreference +/(a*b)\1/ +>xxxaaaaabaaaaab +3,15: aaaaabaaaaab +3,9: aaaaab +/(ab+c)\1/ +>xaaabbbcabbbc +3,13: abbbcabbbc +3,8: abbbc +/(ab?c)\1/ +>abcac +:NOMATCH +>abcacabcabc +5,11: abcabc +5,8: abc +>abcacac +3,7: acac +3,5: acac + +# Very complex backreference +/a(.*)b\1/ +>xxxab +3,5: ab +4,4: +>xxxxazzzbzzz +4,12: azzzbzzz +5,8: zzz + +# Case testing +/abc/i +>AbC +0,3: AbC +/[0-9][a-z]+/i +>xxx0aaZxYT9 +3,10: 0aaZxYT +/a.b/i +>aaaaaaaaaaaxB +10,13: axB +/a.*z/i +>xxxAaaaaZ +3,9: AaaaaZ +>xxaaaZaaa +2,6: aaaZ +/\<(lambda|defun|defmacro)\>/i +> (lambda +5,11: lambda +5,11: lambda +/\<(nil|t)\>/i +>it Nil +3,6: Nil +3,6: Nil +/\<(begin|end)\>/i +>beginning the ending EnD +21,24: EnD +21,24: EnD + +# Some newline tests +/a.*/n +>a\naaa +0,1:a +>xyza\naa +3,4: a +/a.+/n +>a\naaa +2,5: aaa +>xyza\naa +5,7: aa +/a.?/n +>a\naaa +0,1: a +>xyza\naa +3,4: a + +# Newline tests envolving complex patterns +/a.*b.*c/n +>xxaa\nzyacb\nabc +11,14: abc +>xxxab\nabc\nc +6,9: abc +/a.+b.*c/n +>ab\nbc\nabbc +6,10: abbc +/a.?b.*c/n +>ab\ncabbc\ncc +4,8: abbc +/^foo$/n +>bar\nfoobar\nfoo +11,14: foo + +# Not so complex test involving a newline... +/^\s*#\s*(define|include)\s+.+/n +>#define\n#include x +8,18: #include x +9,16: include + +# Check if large strings are working +/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ +>zzzxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxzzz +3,259: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~/ +>String here: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~/ +13,333: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890~ + + +# Some complex repetitions not supported +# Listed here only to make sure the library is not crashing on these +# Repetitions that match an empty match, or an empty string cannot follow +# a complex repetition. A complex repetition is: +# .* or .+ or .? +# .{...} is not supported. +/(.*)(\d*)/ +:BADRPT +/(.*).(\d*)/ +:BADRPT +/(.*)\<(\d*)/ +:BADRPT +/(.*)\s(\d*)/ +:BADRPT +/(.*)\D(\d*)/ +:BADRPT + +# This is a more clear pattern and partially works +/(.*)\D(\d+)/ +>abcW12 +0,6: abcW12 +0,3: abc +4,6: 12 +>abcW12abcW12 +0,6: abcW12 +0,3: abc +4,6: 12 +# This wasn't working in the previous version, but now with only minimal +# matches supported, it works. +>abcW12abcW12a +0,6: abcW12 +0,3: abc +4,6: 12 + +# Note the minimal match +/.*\d/ +>a1a1a1aaaaaaa +0,2: a1 +# Check match offsets +/(.*)\d/ +>a1a1a1aaaaaaa +0,2: a1 +0,1: a +/.*(\d)/ +>a1a1a1aaaaaaa +0,2: a1 +1,2: 1 + +/.*(\d+)/ +:BADRPT -- cgit v1.2.3