Skip to content

feat: support Spark-compatible string_to_map function#20120

Open
unknowntpo wants to merge 27 commits intoapache:mainfrom
unknowntpo:feat-string-to-map-fn
Open

feat: support Spark-compatible string_to_map function#20120
unknowntpo wants to merge 27 commits intoapache:mainfrom
unknowntpo:feat-string-to-map-fn

Conversation

@unknowntpo
Copy link

@unknowntpo unknowntpo commented Feb 3, 2026

Which issue does this PR close?

Rationale for this change

  • Apache Spark's str_to_map creates a map by splitting a string into key-value pairs using delimiters.
  • This function is used in Spark SQL and needed for DataFusion-Comet compatibility.
  • LAST_WIN policy of handling duplicate key will be implemented in next PR.
  • Reference: https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map

What changes are included in this PR?

  • Add Spark-compatible str_to_map function in datafusion-spark crate
  • Function signature: str_to_map(text, [pairDelim], [keyValueDelim]) -> Map<String, String>
    • text: The input string
    • pairDelim: Delimiter between key-value pairs (default: ,)
    • keyValueDelim: Delimiter between key and value (default: :)
  • Located in function/map/ module (returns Map type)

Examples

SELECT str_to_map('a:1,b:2,c:3');
-- {a: 1, b: 2, c: 3}

SELECT str_to_map('a=1;b=2', ';', '=');
-- {a: 1, b: 2}

SELECT str_to_map('key:value');
-- {key: value}

Are these changes tested?

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Feb 3, 2026
@unknowntpo unknowntpo force-pushed the feat-string-to-map-fn branch from 7fa0217 to f97c89e Compare February 3, 2026 02:12
for row_idx in 0..num_rows {
if text_array.is_null(row_idx) {
null_buffer[row_idx] = false;
offsets.push(*offsets.last().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the last() call return None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, offsets is initialized with one element 0

Copy link
Author

@unknowntpo unknowntpo Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, last() will never return None here. The offsets vector is initialized with vec![0], so it always has at least one element before the loop starts.

I've refactor this and introduce a current_offset variable to avoid confusion.

# Test cases derived from Spark test("StringToMap"):
# https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
#
# Note: Duplicate key handling uses LAST_WIN policy (not EXCEPTION which is Spark default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll implement LAST_WIN policy as a TODO, because modification of datafusion-comet is required.

/// <https://spark.apache.org/docs/latest/api/sql/index.html#str_to_map>
///
/// Creates a map from a string by splitting on delimiters.
/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (I copied from the spark link above), should keep consistent with spark doc IMO

Suggested change
/// string_to_map(text, pairDelim, keyValueDelim) -> Map<String, String>
/// str_to_map(text[, pairDelim[, keyValueDelim]]) -> Map<String, String>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change to str_to_map.

}

fn string_to_map_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let text_array = &args[0];
Copy link
Contributor

@dentiny dentiny Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious should we check (or assert, if the signature already guards against bad usage) arg count cannot be >= 4? And add unit test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        Self {
            signature: Signature::one_of(
                vec![
                    // string_to_map(text)
                    TypeSignature::String(1),
                    // string_to_map(text, pairDelim)
                    TypeSignature::String(2),
                    // string_to_map(text, pairDelim, keyValueDelim)
                    TypeSignature::String(3),
                ],
                Volatility::Immutable,
            ),
            aliases: vec![String::from("str_to_map")],
        }

signature make sure that this will not happened.
but I've added assertion for defense.

for row_idx in 0..num_rows {
if text_array.is_null(row_idx) {
null_buffer[row_idx] = false;
offsets.push(*offsets.last().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, offsets is initialized with one element 0

}

/// Extract scalar string value from array (assumes all values are the same)
fn get_scalar_string(array: &ArrayRef) -> Result<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn get_scalar_string(array: &ArrayRef) -> Result<String> {
fn get_delimeter_scalar_string(array: &ArrayRef) -> Result<String> {

Do you think it matches the intention (since you clearly said it's delim parsing at L216)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, renamed to extract_delimiter_from_string_array with proper testing.

)
})?;

if string_array.len() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious in which case will the len be 0? I thought we should assert the len 😲

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've added a assertion here.

// "a:" -> kv = ["a", ""] -> key="a", value=Some("")
// ":1" -> kv = ["", "1"] -> key="", value=Some("1")
let kv: Vec<&str> = pair.splitn(2, &kv_delim).collect();
let key = kv[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut iter = pair.splitn(2, kv_delim);
let key = iter.next().unwrap_or("");
let value = iter.next().unwrap_or(None);

so we don't need heap allocation for vector?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


// Split text into key-value pairs using pair_delim.
// Example: "a:1,b:2" with pair_delim="," -> ["a:1", "b:2"]
let pairs: Vec<&str> = text.split(&pair_delim).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pair in text.split(pair_delim) {

to avoid heap allocation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

}

fn name(&self) -> &str {
"string_to_map"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference to this alias? As far as I can tell Spark only has str_to_map

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change to str_to_map.

};

// Process each row
let text_array = text_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let text_array = as_string_array(text_array)?;

Easier downcasting: https://docs.rs/datafusion/latest/datafusion/common/cast/fn.as_string_array.html

However we need to consider that other string types exist such as LargeUtf8 and Utf8View

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've followed similar pattern like parse_url.rs to match input arguments' data type.

"Delimiter array should not be empty"
);

// In columnar execution, scalar delimiter is expanded to array to match batch size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume this; for example this is a valid test case that will fail:

query ?
SELECT string_to_map(col1, col2, col3)
FROM (VALUES ('a=1,b=2', ',', '='), ('x#9', ',', '#'), (NULL, ',', '=')) AS t(col1, col2, col3);
----
{a: 1, b: 2}
{x: 9}
NULL
  • Delimiters can vary per row

We should either choose to support only scalar delimiters for now (look at invoke_with_args and how we can work with ColumnarValues directly) or need to ensure we respect per-row delimiters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I decided to supported per-row delimiters, with test cases added.

// Test cases derived from Spark ComplexTypeSuite:
// https://github.com/apache/spark/blob/v4.0.0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala#L525-L618
#[test]
fn test_string_to_map_cases() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move all these test cases to SLTs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, all tests are moved to SLTs.

let mut null_buffer = vec![true; num_rows];

for row_idx in 0..num_rows {
if text_array.is_null(row_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to support per-row delimiters we'll need to consider their nullability; could consider using NullBuffer::union to build the final nullbuffer upfront once, though keep in mind we'll have up to 3 input arrays

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks for suggestion!

keys_builder.append_value("");
values_builder.append_null();
current_offset += 1;
offsets.push(current_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered using MapBuilder here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapBuilder is more convenient and elegant, I refactored my code to use it. Please take a look.

unknowntpo and others added 22 commits February 6, 2026 19:53
Adds string_to_map (alias: str_to_map) function that creates a map
from a string by splitting on delimiters.

- Supports 1-3 args: text, pair_delim (default ','), key_value_delim (default ':')
- Returns Map<Utf8, Utf8>
- NULL input returns NULL
- Empty string returns {"": NULL} (Spark behavior)
- Missing key_value_delim results in NULL value
- Duplicate keys: last wins (LAST_WIN policy)

Test cases derived from Spark v4.0.0 ComplexTypeSuite.scala.
The function returns Map type so it belongs in the map module.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Follows the source code move in the previous commit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace len==0 check with assert (delimiter array should never be empty)
- Add comment explaining scalar expansion in columnar execution
- Add unit test for delimiter extraction (single, multi-char, expanded scalar)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add multi-row test with default delimiters
- Add multi-row test with custom delimiters (comma and equals)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace `offsets.last().unwrap()` with explicit `current_offset` tracking
- Add table-driven unit tests covering s0-s6 Spark test cases + null input
- Add multi-row test demonstrating Arrow MapArray internal structure
- Import NullBuffer at module level for cleaner code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documents current behavior and adds TODO for Spark's EXCEPTION default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pure file rename, no content changes. Prepares for the
function name change in the next commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename struct, function name, and all references from
string_to_map to str_to_map. Remove alias.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite str_to_map to use arrow MapBuilder instead of manual
  offsets + map_from_keys_values_offsets_nulls
- Default to EXCEPTION policy for duplicate keys (Spark 3.0+ default)
- Support per-row delimiters (extract delimiter per row, not once)
- Null delimiter produces null map row

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…View

Dispatch with explicit type matching per arg count (like parse_url),
using datafusion_common::cast helpers instead of AsArray trait.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
unknowntpo and others added 4 commits February 6, 2026 19:53
Addresses review comment: delimiters can vary per row when passed
as columns rather than literals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to_map

Replace per-row is_null() checks with a precomputed combined NullBuffer
using bitmap-level AND via NullBuffer::union, as suggested in PR review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move seen_keys HashSet outside the row loop and clear() it each
iteration, reusing the backing allocation instead of allocating a
new HashSet per row.
…Spark error message

- Remove redundant Rust unit tests (all covered by SLT)
- Extract DEFAULT_PAIR_DELIM and DEFAULT_KV_DELIM constants
- Match Spark's exact DUPLICATED_MAP_KEY error message
- Add TODO for configurable mapKeyDedupPolicy (LAST_WIN) in follow-up PR
@unknowntpo unknowntpo force-pushed the feat-string-to-map-fn branch from b025293 to 82bba6c Compare February 6, 2026 12:02
Already documented at the duplicate key test case section.
@unknowntpo unknowntpo marked this pull request as ready for review February 6, 2026 12:04
@unknowntpo unknowntpo force-pushed the feat-string-to-map-fn branch from 82bba6c to 1e44dfa Compare February 6, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants