Skip to content

Adapt for multibyte strings as UTF-8 #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/MatchesSolver.php
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ protected function result(
array $matrix
) {
return array_map(function (Match $result) use ($stringA) {
$result->value = substr($stringA, $result->index(), $result->length);
$result->value = mb_substr($stringA, $result->index(), $result->length);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mb_substr() is lacking the optional fourth argument. The string will be assumed to be encoded using whatever happens to be the return value of mb_internal_encoding(). This may or may not be the encoding that the string uses.

We can try to get the correct encoding using mb_detect_encoding(). But, hey, that can get tricky since the return value might be false. Oh, and mb_detect_encoding() can get things wrong if a string has invalidly-encoded characters which is why we need to also set the optional third argument to true. Doing so has the (very useful) side benefit of significantly reducing the chance of an invalid encoding attack.

Not entirely sure how you'd work this into these changes, but here's an example of the logic you'd need to use:

function detectMultibyteStringEncoding(string $string): string
{
    $encoding = mb_detect_encoding($string, mb_detect_order(), true);

    // Return the detected encoding or 'utf-8' if not detected. 
    // This is just an example, ideally 'utf-8' would be defined in a well-named constant.
    return is_string($encoding) ? $encoding : 'utf-8';
}

// ...

$result->value = mb_substr($stringA, $result->index(), $result->length, $encoding);

The same applies to all calls to mb_*() functions that can accept an optional $encoding argument, of which there are some more in these set of changes.

Aren't mb strings fun!


return $result;
}, $longestIndexes);
Expand Down
12 changes: 9 additions & 3 deletions src/Solver.php
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ protected function result(
string $stringB,
array $matrix
) {
return count($longestIndexes) === 0 ? '' : substr($stringA, $longestIndexes[0], $longestLength);
return count($longestIndexes) === 0 ? '' : mb_substr($stringA, $longestIndexes[0], $longestLength);
}

/**
Expand All @@ -55,8 +55,14 @@ public function solve(string $stringA, string $stringB)
return call_user_func_array([$this, 'solve'], $arguments);
}

$charsA = str_split($stringA);
$charsB = str_split($stringB);
$charsA = [];
$charsB = [];
for ($i=0; $i < max(mb_strlen($stringA), 1); $i++) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that this loop is present due to a lack of mb_str_split().

Can this be logic be factored out to a method (perhaps named mb_str_split() so that the code at this point reads more clearly?

$charsA[] = mb_substr($stringA, $i, 1);
}
for ($i=0; $i < max(mb_strlen($stringB), 1); $i++) {
$charsB[] = mb_substr($stringB, $i, 1);
}

$matrix = array_fill_keys(array_keys($charsA), array_fill_keys(array_keys($charsB), 0));
$longestLength = 0;
Expand Down
23 changes: 23 additions & 0 deletions test/suite/MatchesArrayProvidersTrait.php
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,29 @@ public function twoStringsOrderedMatchesArrayProvider()
],
],
],
'UTF-8' => [
'L’été était chaud.',
'L’hiver était froid.',
[
[
'value' => ' était ',
'length' => 7,
'indexes' => [5, 7],
],
],
],
// In UTF-8: é = 0xC3A9 and © = 0xC2A9 (the last byte is the same but the Unicode characters are different)
'UTF-8 (nasty)' => [
'L’été était chaud.',
'L’hiver ©tait froid.',
[
[
'value' => 'tait ',
'length' => 5,
'indexes' => [7, 9],
],
],
],
];
}
}