php-字符串搜索函数调研

将一个字符串插入到另外一个字符串中间，是php中常常出现的一个操作，具体的实现方案也有替换和直接插入两种，涉及到的相关函数也不少，下面就是我在一个实际工作当中处理字符串插入的效率的调研。

场景：

在一个 1MB左右的文本文件(content)中的 “hello” 前插入一个10个字节大小的字符串(string)。

采取的方案：（机器性能不同，用相对的比较方式对比性能）

编号	方式	性能(此处是相对值，仅表示各函数的相对速度大小)	备注
A	【替换】str_replace(‘#hello#’,$string,$content)	3	备注
B	【替换】strtr($content,array(”=>$script));	42	备注
C	【正则替换】preg_replace(‘##hello##’,$script,$content);	4.4	备注
D	【Substr_replace +strops 插入】substr_replace($content,$script,strpos($content,”),0)	1	备注

A 方案 str_replace() 函数

函数签名：

mixed str_replace ( mixed $search , mixed $replace , mixed $subject [, int &$count ] )

参数
如果 search 和 replace 为数组，那么 str_replace() 将对 subject 做二者的映射替换。如果 replace 的值的个数少于 search 的个数，多余的替换将使用空字符串来进行。如果 search 是一个数组而 replace 是一个字符串，那么 search 中每个元素的替换将始终使用这个字符串。该转换不会改变大小写。

如果 search 和 replace 都是数组，它们的值将会被依次处理。

search
查找的目标值，也就是 needle。一个数组可以指定多个目标。
replace
search 的替换值。一个数组可以被用来指定多重替换。
subject
执行替换的数组或者字符串。也就是 haystack。

如果 subject 是一个数组，替换操作将遍历整个 subject，返回值也将是一个数组。
count
如果被指定，它的值将被设置为替换发生的次数。
返回值 ¶
该函数返回替换后的数组或者字符串。

输入的类型可以是字符串或者数组，先对字符串进行分析

情况1．输入的 search 和 replace 均为字符串

替换的过程首先要查找 search ，在一个字符串中查找另外一个字符串，就变成了经典的 ‘find needls in haystack’ 问题了。
在源码中，调用情况是 str_replace->php_str_to_str_ex[字符串替换]->zend_memstr php_str_to_str_ex在 ext/standerd/string.c 的php_str_to_str_ex 顾名思义，他就是主要负责处理字符串替换，而 zend_memstr 主要是查找 needles

zend_memstr 源码：

static inline char *
zend_memnstr(char *haystack, char *needle, int needle_len, char *end)
{
    char *p = haystack;
    char ne = needle[needle_len-1];
     
    if (needle_len == 1) {
        return (char *)memchr(p, *needle, (end-p));
    }

    if (needle_len > end-haystack) {
        return NULL;
    }

    end -= needle_len;

    while (p <= end) {
      // 如果在 p 指向的前(end - p +1)个空间查找到 needle 的首字符，便比较needle的最后一个字符和 p 当前指向的位置 + needle_len 个字符;
      // 这个算法在 needle的第一个字符不会密集出现在 haystack 中时，会比较高效，memchr 会带来较大的 移动步长，相比于 KMP，Horspol 算法，简单，易懂
      if ((p = (char *)memchr(p, *needle, (end-p+1))) && ne == p[needle_len-1]) {
            if (!memcmp(needle, p, needle_len-1)) {
                return p;
            }
        }

        if (p == NULL) {
            return NULL;
        }

        p++;
    }

    return NULL;
}

在使用 zend_memstr 查找到 search 的内存地址之后，php_str_to_str_ex 就会使用memcpy ，用 to 来替换 from。

【值得注意的是，在php_str_to_str_ex 中，会一直查找 subject 中的 search ，知道把他们全部替换为 replace】也就是说，这个过程在发生一次之后不会退出，直到循环，将所有 content 里的 to 替换成 from 才会退出。

当 str_replace 输入为数组时，使用的算法和输入为 string 是一样的，只不过相当于循环的处理输入的数组，将他们单个取出，像对待字符串一样去处理。

方案B strtr() 函数

说明 ¶
string strtr ( string $str , string $from , string $to )

string strtr ( string $str , array $replace_pairs )

该函数返回 str 的一个副本，并将在 from 中指定的字符转换为 to 中相应的字符。比如， $from[$n]中每次的出现都会被替换为 $to[$n]，其中 $n 是两个参数都有效的位移(offset)。

如果 from 与 to 长度不相等，那么多余的字符部分将被忽略。 str 的长度将会和返回的值一样。

参数 ¶

str

待转换的字符串

from

字符串中与将要被转换的目的字符 to 相对应的源字符

to

字符串中与将要被转换的字符 from 相对应的目的字符

replace_pairs

参数 replace_pairs 可以用来取代 to 和 from 参数，因为它是以 array(‘from’ => ‘to’, …) 格式出现的数组。

返回值 ¶
返回转换后的字符串。

如果 replace_pairs 中包含一个空字符串（“”）键，那么将返回 FALSE。 If the str is not a scalar then it is not typecasted into a string, instead a warning is raised and NULLis returned.

strtr 有两种输入情况，源码中也是将它分成了两个子过程来处理：

输入为字符串：

PHPAPI char *php_strtr(char *str, int len, char *str_from, char *str_to, int trlen)
{
    int i;
    unsigned char xlat[256];

    if ((trlen < 1) || (len < 1)) {
        return str;
    }
    // 一个 256 大小的数组,代表 ascll 码
    for (i = 0; i < 256; xlat[i] = i, i++);
    // 用 to 来替换指定的from字符的 ascll
    for (i = 0; i < trlen; i++) {
        xlat[(unsigned char) str_from[i]] = str_to[i];
    }
    // 重新赋值
    for (i = 0; i < len; i++) {
        str[i] = xlat[(unsigned char) str[i]];
    }

    return str;
}

输入为数组：

/* {{{ php_strtr_array
 */
static void php_strtr_array(zval *return_value, char *str, int slen, HashTable *hash)
{
    zval **entry;
    char  *string_key;
    uint   string_key_len;
    zval **trans;
    zval   ctmp;
    ulong num_key;
    int minlen = 128*1024;
    int maxlen = 0, pos, len, found;
    char *key;
    HashPosition hpos;
    smart_str result = {0};
    HashTable tmp_hash;

    zend_hash_init(&tmp_hash, zend_hash_num_elements(hash), NULL, NULL, 0);
    zend_hash_internal_pointer_reset_ex(hash, &hpos);
    while (zend_hash_get_current_data_ex(hash, (void **)&entry, &hpos) == SUCCESS) {
        switch (zend_hash_get_current_key_ex(hash, &string_key, &string_key_len, &num_key, 0, &hpos)) {
            case HASH_KEY_IS_STRING:
                len = string_key_len-1;
                if (len < 1) {
                    zend_hash_destroy(&tmp_hash);
                    RETURN_FALSE;
                }
                zend_hash_add(&tmp_hash, string_key, string_key_len, entry, sizeof(zval*), NULL);
                if (len > maxlen) {
                    maxlen = len;
                }
                if (len < minlen) {
                    minlen = len;
                }
                break;

            case HASH_KEY_IS_LONG:
                Z_TYPE(ctmp) = IS_LONG;
                Z_LVAL(ctmp) = num_key;

                convert_to_string(&ctmp);
                len = Z_STRLEN(ctmp);
                zend_hash_add(&tmp_hash, Z_STRVAL(ctmp), len+1, entry, sizeof(zval*), NULL);
                zval_dtor(&ctmp);

                if (len > maxlen) {
                    maxlen = len;
                }
                if (len < minlen) {
                    minlen = len;
                }
                break;
        }
        zend_hash_move_forward_ex(hash, &hpos);
    }

    key = emalloc(maxlen+1);
    pos = 0;
    // 替换的步骤在这里
    while (pos < slen) {
        if ((pos + maxlen) > slen) {
            maxlen = slen - pos;
        }

        found = 0;
     // key 是每一次需要被替换的字符串
        memcpy(key, str+pos, maxlen);

        for (len = maxlen; len >= minlen; len--) {
            key[len] = 0;
    // trans 被赋值为新的字符串
            if (zend_hash_find(&tmp_hash, key, len+1, (void**)&trans) == SUCCESS) {
                char *tval;
                int tlen;
                zval tmp;

                if (Z_TYPE_PP(trans) != IS_STRING) {
                    tmp = **trans;
                    zval_copy_ctor(&tmp);
                    convert_to_string(&tmp);
                    tval = Z_STRVAL(tmp);
                    tlen = Z_STRLEN(tmp);
                } else {
                    tval = Z_STRVAL_PP(trans);
                    tlen = Z_STRLEN_PP(trans);
                }
      // 将 trans 附加到一个新的字符串，最终用来被返回
                smart_str_appendl(&result, tval, tlen);
                pos += len;
                found = 1;

                if (Z_TYPE_PP(trans) != IS_STRING) {
                    zval_dtor(&tmp);
                }
                break;
            }
        }

        if (! found) {
            smart_str_appendc(&result, str[pos++]);
        }
    }

    efree(key);
    zend_hash_destroy(&tmp_hash);
    smart_str_0(&result);
    RETVAL_STRINGL(result.c, result.len, 0);
}

从 strtr 的执行过程来看，strtr 也是需要遍历整个 subject 来替换 to -> from，这个也是它在我们的场景中他的效率比较慢的原因。

方案C 正则替换

正则替换涉及到正则解析引擎，这个应该也是需要遍历 subject 来替换 to -> from。

方案D substr_replace +strops 插入

substr_replace 函数签名：

mixed substr_replace ( mixed $string , mixed $replacement , mixed $start [, mixed$length ] )

substr_replace() 在字符串 string 的副本中将由 start 和可选的 length 参数限定的子字符串使用 replacement进行替换。

strpos 函数签名：

说明 ¶
mixed strpos ( string $haystack , mixed $needle [, int $offset = 0 ] )

返回 needle 在 haystack 中首次出现的数字位置。

使用 strpos 定位目标串位置，然后 substr_replace 快速替换。strpos 使用上面提到的 zend_memstr 查找 needle，然后 substr_replace 替换，替换的核心操作就是涉及到内存的分配。这个方法，只是会在目标串中查找一次needle，相比之前的方法，减少了目标串的遍历，由于我们的目标串较大，所以这个方法在相对的比较中，占据了较大的优势.

总结：

在网上搜索字符串替换方案时，会有各种五花八门的解释和分析，但是都缺少原理上的分析和具体场景。在这次调研过程中，我也发现php在一个问题的解决上，有较多的解决办法，看似相同，却又有区别，这个就要对源码较清晰的了解方法的执行过程，才能写出较好性能的程序。