feat: support Spark-compatible `string_to_map` function by unknowntpo · Pull Request #20120 · apache/datafusion

unknowntpo · 2026-02-03T01:57:44Z

Which issue does this PR close?

Part of [EPIC] Complete datafusion-spark Spark Compatible Functions #15914
Related comet issue: [Feature] Support Spark expression: string_to_map datafusion-comet#3168

Rationale for this change

Apache Spark's str_to_map creates a map by splitting a string into key-value pairs using delimiters.
This function is used in Spark SQL and needed for DataFusion-Comet compatibility.
LAST_WIN policy of handling duplicate key will be implemented in next PR.
Reference: https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map

What changes are included in this PR?

Add Spark-compatible str_to_map function in datafusion-spark crate
Function signature: str_to_map(text, [pairDelim], [keyValueDelim]) -> Map<String, String>
- text: The input string
- pairDelim: Delimiter between key-value pairs (default: ,)
- keyValueDelim: Delimiter between key and value (default: :)
Located in function/map/ module (returns Map type)

Examples

SELECT str_to_map('a:1,b:2,c:3');
-- {a: 1, b: 2, c: 3}

SELECT str_to_map('a=1;b=2', ';', '=');
-- {a: 1, b: 2}

SELECT str_to_map('key:value');
-- {key: value}

Are these changes tested?

sqllogictest: test_files/spark/map/string_to_map.slt, test cases derived from Spark test("StringToMap"):

Are there any user-facing changes?

Yes.

hsiang-c · 2026-02-03T06:19:49Z

datafusion/spark/src/function/map/string_to_map.rs

+    for row_idx in 0..num_rows {
+        if text_array.is_null(row_idx) {
+            null_buffer[row_idx] = false;
+            offsets.push(*offsets.last().unwrap());


Will the last() call return None?

no, offsets is initialized with one element 0

No, last() will never return None here. The offsets vector is initialized with vec![0], so it always has at least one element before the loop starts.

I've refactor this and introduce a current_offset variable to avoid confusion.

hsiang-c · 2026-02-03T06:30:44Z

datafusion/sqllogictest/test_files/spark/map/string_to_map.slt

+# Test cases derived from Spark test("StringToMap"):
+# https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
+#
+# Note: Duplicate key handling uses LAST_WIN policy (not EXCEPTION which is Spark default)


FYI: https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4502-L4511

I'll implement LAST_WIN policy as a TODO, because modification of datafusion-comet is required.

dentiny · 2026-02-03T07:43:17Z

datafusion/spark/src/function/map/string_to_map.rs

+/// <https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map>
+///
+/// Creates a map from a string by splitting on delimiters.
+/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>


nit (I copied from the spark link above), should keep consistent with spark doc IMO

Suggested change

/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>

/// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String>

You're right, I'll change to str_to_map.

dentiny · 2026-02-03T07:46:01Z

datafusion/spark/src/function/map/string_to_map.rs

+}
+
+fn string_to_map_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
+    let text_array = &args[0];


Curious should we check (or assert, if the signature already guards against bad usage) arg count cannot be >= 4? And add unit test?

Self { signature: Signature::one_of( vec![ // string_to_map(text) TypeSignature::String(1), // string_to_map(text, pairDelim) TypeSignature::String(2), // string_to_map(text, pairDelim, keyValueDelim) TypeSignature::String(3), ], Volatility::Immutable, ), aliases: vec![String::from("str_to_map")], }

signature make sure that this will not happened.
but I've added assertion for defense.

dentiny · 2026-02-03T07:58:10Z

datafusion/spark/src/function/map/string_to_map.rs

+    for row_idx in 0..num_rows {
+        if text_array.is_null(row_idx) {
+            null_buffer[row_idx] = false;
+            offsets.push(*offsets.last().unwrap());


no, offsets is initialized with one element 0

dentiny · 2026-02-03T08:04:02Z

datafusion/spark/src/function/map/string_to_map.rs

+}
+
+/// Extract scalar string value from array (assumes all values are the same)
+fn get_scalar_string(array: &ArrayRef) -> Result<String> {


Suggested change

fn get_scalar_string(array: &ArrayRef) -> Result<String> {

fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> {

Do you think it matches the intention (since you clearly said it's delim parsing at L216)?

Good catch, renamed to extract_delimiter_from_string_array with proper testing.

dentiny · 2026-02-03T08:05:39Z

datafusion/spark/src/function/map/string_to_map.rs

+            )
+        })?;
+
+    if string_array.len() == 0 {


curious in which case will the len be 0? I thought we should assert the len 😲

Good catch, I've added a assertion here.

dentiny · 2026-02-03T08:09:47Z

datafusion/spark/src/function/map/string_to_map.rs

+            //   "a:"    -> kv = ["a", ""]    -> key="a", value=Some("")
+            //   ":1"    -> kv = ["", "1"]    -> key="",  value=Some("1")
+            let kv: Vec<&str> = pair.splitn(2, &kv_delim).collect();
+            let key = kv[0];


let mut iter = pair.splitn(2, kv_delim); let key = iter.next().unwrap_or(""); let value = iter.next().unwrap_or(None);

so we don't need heap allocation for vector?

dentiny · 2026-02-03T08:10:47Z

datafusion/spark/src/function/map/string_to_map.rs

+
+        // Split text into key-value pairs using pair_delim.
+        // Example: "a:1,b:2" with pair_delim="," -> ["a:1", "b:2"]
+        let pairs: Vec<&str> = text.split(&pair_delim).collect();


for pair in text.split(pair_delim) {

to avoid heap allocation

Jefffrey · 2026-02-04T03:10:10Z

datafusion/spark/src/function/map/string_to_map.rs

+    }
+
+    fn name(&self) -> &str {
+        "string_to_map"


Is there a reference to this alias? As far as I can tell Spark only has str_to_map

You're right, I'll change to str_to_map.

Jefffrey · 2026-02-04T03:12:23Z

datafusion/spark/src/function/map/string_to_map.rs

+    };
+
+    // Process each row
+    let text_array = text_array


let text_array = as_string_array(text_array)?;

Easier downcasting: https://docs.rs/datafusion/latest/datafusion/common/cast/fn.as_string_array.html

However we need to consider that other string types exist such as LargeUtf8 and Utf8View

Okay, I've followed similar pattern like parse_url.rs to match input arguments' data type.

Jefffrey · 2026-02-04T03:14:20Z

datafusion/spark/src/function/map/string_to_map.rs

+        "Delimiter array should not be empty"
+    );
+
+    // In columnar execution, scalar delimiter is expanded to array to match batch size.


We can't assume this; for example this is a valid test case that will fail:

query ? SELECT string_to_map(col1, col2, col3) FROM (VALUES ('a=1,b=2', ',', '='), ('x#9', ',', '#'), (NULL, ',', '=')) AS t(col1, col2, col3); ---- {a: 1, b: 2} {x: 9} NULL

Delimiters can vary per row

We should either choose to support only scalar delimiters for now (look at invoke_with_args and how we can work with ColumnarValues directly) or need to ensure we respect per-row delimiters

Okay, I decided to supported per-row delimiters, with test cases added.

Jefffrey · 2026-02-04T03:14:49Z

datafusion/spark/src/function/map/string_to_map.rs

+    // Test cases derived from Spark ComplexTypeSuite:
+    // https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
+    #[test]
+    fn test_string_to_map_cases() {


Is it possible to move all these test cases to SLTs?

Okay, all tests are moved to SLTs.

Jefffrey · 2026-02-04T03:16:20Z

datafusion/spark/src/function/map/string_to_map.rs

+    let mut null_buffer = vec![true; num_rows];
+
+    for row_idx in 0..num_rows {
+        if text_array.is_null(row_idx) {


If we decide to support per-row delimiters we'll need to consider their nullability; could consider using NullBuffer::union to build the final nullbuffer upfront once, though keep in mind we'll have up to 3 input arrays

Fixed, thanks for suggestion!

Jefffrey · 2026-02-04T03:17:47Z

datafusion/spark/src/function/map/string_to_map.rs

+            keys_builder.append_value("");
+            values_builder.append_null();
+            current_offset += 1;
+            offsets.push(current_offset);


Have we considered using MapBuilder here?

MapBuilder is more convenient and elegant, I refactored my code to use it. Please take a look.

Adds string_to_map (alias: str_to_map) function that creates a map from a string by splitting on delimiters. - Supports 1-3 args: text, pair_delim (default ','), key_value_delim (default ':') - Returns Map<Utf8, Utf8> - NULL input returns NULL - Empty string returns {"": NULL} (Spark behavior) - Missing key_value_delim results in NULL value - Duplicate keys: last wins (LAST_WIN policy) Test cases derived from Spark v4.0.0 ComplexTypeSuite.scala.

The function returns Map type so it belongs in the map module. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Follows the source code move in the previous commit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…rray

- Replace len==0 check with assert (delimiter array should never be empty) - Add comment explaining scalar expansion in columnar execution - Add unit test for delimiter extraction (single, multi-char, expanded scalar) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add multi-row test with default delimiters - Add multi-row test with custom delimiters (comma and equals) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace `offsets.last().unwrap()` with explicit `current_offset` tracking - Add table-driven unit tests covering s0-s6 Spark test cases + null input - Add multi-row test demonstrating Arrow MapArray internal structure - Import NullBuffer at module level for cleaner code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Documents current behavior and adds TODO for Spark's EXCEPTION default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Pure file rename, no content changes. Prepares for the function name change in the next commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename struct, function name, and all references from string_to_map to str_to_map. Remove alias. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rewrite str_to_map to use arrow MapBuilder instead of manual offsets + map_from_keys_values_offsets_nulls - Default to EXCEPTION policy for duplicate keys (Spark 3.0+ default) - Support per-row delimiters (extract delimiter per row, not once) - Null delimiter produces null map row Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…View Dispatch with explicit type matching per arg count (like parse_url), using datafusion_common::cast helpers instead of AsArray trait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses review comment: delimiters can vary per row when passed as columns rather than literals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…to_map Replace per-row is_null() checks with a precomputed combined NullBuffer using bitmap-level AND via NullBuffer::union, as suggested in PR review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move seen_keys HashSet outside the row loop and clear() it each iteration, reusing the backing allocation instead of allocating a new HashSet per row.

…Spark error message - Remove redundant Rust unit tests (all covered by SLT) - Extract DEFAULT_PAIR_DELIM and DEFAULT_KV_DELIM constants - Match Spark's exact DUPLICATED_MAP_KEY error message - Add TODO for configurable mapKeyDedupPolicy (LAST_WIN) in follow-up PR

Already documented at the duplicate key test case section.

github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Feb 3, 2026

unknowntpo force-pushed the feat-string-to-map-fn branch from 7fa0217 to f97c89e Compare February 3, 2026 02:12

hsiang-c reviewed Feb 3, 2026

View reviewed changes

dentiny reviewed Feb 3, 2026

View reviewed changes

Jefffrey reviewed Feb 4, 2026

View reviewed changes

unknowntpo and others added 22 commits February 6, 2026 19:53

refactor: move string_to_map from string to map module

8e85fea

The function returns Map type so it belongs in the map module. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: move string_to_map test from spark/string to spark/map

b4452fe

Follows the source code move in the previous commit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add inline comments explaining string_to_map parsing logic

5389da6

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: keep utils module private in map module

bf51d68

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix lint

50b0c96

refine test cases

09fe99e

refactor: avoid heap allocation in string_to_map pairs iteration

8c4b200

refactor: avoid heap allocation in string_to_map kv splitting

2b10df6

refactor: rename get_scalar_string to extract_delimiter_from_string_a…

d7ab94e

…rray

test: add multi-row sqllogictests for string_to_map

f6b3852

- Add multi-row test with default delimiters - Add multi-row test with custom delimiters (comma and equals) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: merge multi-row test into table-driven tests

42f2245

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

test: add duplicate key test case with LAST_WIN behavior

bb4dd48

Documents current behavior and adds TODO for Spark's EXCEPTION default. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: add assert for args length in string_to_map

b4d8d00

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: put helper at bottom

7b7c841

style: run cargo fmt

3f424d2

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

rename: string_to_map -> str_to_map (file rename only)

895858c

Pure file rename, no content changes. Prepares for the function name change in the next commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: rename SparkStringToMap to SparkStrToMap

6ce0eea

Rename struct, function name, and all references from string_to_map to str_to_map. Remove alias. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: use as_string_array helpers and support Utf8/LargeUtf8/Utf8…

fc24687

…View Dispatch with explicit type matching per arg count (like parse_url), using datafusion_common::cast helpers instead of AsArray trait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

unknowntpo and others added 4 commits February 6, 2026 19:53

test: add per-row delimiter SLT test for str_to_map

9c3093e

Addresses review comment: delimiters can vary per row when passed as columns rather than literals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: use NullBuffer::union for precomputed null handling in str_…

61dd3c1

…to_map Replace per-row is_null() checks with a precomputed combined NullBuffer using bitmap-level AND via NullBuffer::union, as suggested in PR review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: reuse HashSet across rows to avoid per-row heap allocation

a1f54ee

Move seen_keys HashSet outside the row loop and clear() it each iteration, reusing the backing allocation instead of allocating a new HashSet per row.

unknowntpo force-pushed the feat-string-to-map-fn branch from b025293 to 82bba6c Compare February 6, 2026 12:02

chore: remove redundant duplicate key comment from SLT header

1e44dfa

Already documented at the duplicate key test case section.

unknowntpo marked this pull request as ready for review February 6, 2026 12:04

unknowntpo force-pushed the feat-string-to-map-fn branch from 82bba6c to 1e44dfa Compare February 6, 2026 12:05

	/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>
	/// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String>

	fn get_scalar_string(array: &ArrayRef) -> Result<String> {
	fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> {

Conversation

unknowntpo commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Examples

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

unknowntpo Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dentiny Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

unknowntpo commented Feb 3, 2026 •

edited

Loading

unknowntpo Feb 3, 2026 •

edited

Loading

dentiny Feb 3, 2026 •

edited

Loading