Skip to content

guard timestamp_column in LocalDedupNode against missing DataFrame column #5985

Open
faustaround wants to merge 1 commit intofeast-dev:masterfrom
faustaround:fix/dedup-node-timestamp-column-keyerror
Open

guard timestamp_column in LocalDedupNode against missing DataFrame column #5985
faustaround wants to merge 1 commit intofeast-dev:masterfrom
faustaround:fix/dedup-node-timestamp-column-keyerror

Conversation

@faustaround
Copy link

@faustaround faustaround commented Feb 18, 2026

LocalDedupNode.execute unconditionally appends timestamp_column to
sort_keys, but created_timestamp_column (added immediately after)
already guards against this with an in df.columns check:

  sort_keys = [self.column_info.timestamp_column]   # no guard
  if (
      self.column_info.created_timestamp_column
      and self.column_info.created_timestamp_column in df.columns  # has guard
  ):
      sort_keys.append(self.column_info.created_timestamp_column)

When the feature view's timestamp_field column is not declared in the
feature schema, the DAG pipeline projects it away before the dedup node
runs. The column is present in the raw Redshift result but absent from the
DataFrame by the time drop_duplicates is called, causing:

  KeyError: '<timestamp_field_name>'

This affects any feature view where timestamp_field is an internal
bookkeeping column not exposed as a feature.

Apply the same guard to timestamp_column for consistency, and add a
fallback to deduplicate by key only when no sort columns survive (rather
than crashing).


Open with Devin

…t in feature schema

 timestamp_column is unconditionally added to sort_keys even when the column doesn't exist in the DataFrame (e.g. when the
  timestamp_field isn't declared in the feature view schema and gets projected away by the DAG pipeline). The adjacent created_timestamp_column already has an in df.columns guard — timestamp_column needs the same treatment.
@faustaround faustaround requested a review from a team as a code owner February 18, 2026 22:43
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +204 to +207
if sort_keys:
df = self.backend.drop_duplicates(
df, keys=dedup_keys, sort_by=sort_keys, ascending=False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Deduplication silently skipped when no timestamp columns are present in DataFrame

When neither timestamp_column nor created_timestamp_column is present in the DataFrame columns, sort_keys will be empty and the if sort_keys: guard at line 204 causes the entire drop_duplicates call to be skipped. This means duplicate rows (by join key) pass through undetected.

Root Cause and Impact

The PR description states the intent is to "add a fallback to deduplicate by key only when no sort columns survive (rather than crashing)." However, the implementation at lines 204-207 simply skips deduplication entirely when sort_keys is empty:

if sort_keys:
    df = self.backend.drop_duplicates(
        df, keys=dedup_keys, sort_by=sort_keys, ascending=False
    )

When sort_keys is empty (falsy), no deduplication happens at all. The correct behavior should be to still deduplicate by dedup_keys alone — just without a deterministic sort order. For example, using pandas' df.drop_duplicates(subset=dedup_keys) or equivalent.

Impact: Any feature view where timestamp_field is an internal bookkeeping column not exposed in the feature schema will have its timestamp column projected away before the dedup node runs. In this case, duplicate entity rows will silently remain in the output, leading to incorrect feature values (e.g., duplicated rows in training datasets or multiple values written to the online store for the same entity key).

Suggested change
if sort_keys:
df = self.backend.drop_duplicates(
df, keys=dedup_keys, sort_by=sort_keys, ascending=False
)
if sort_keys:
df = self.backend.drop_duplicates(
df, keys=dedup_keys, sort_by=sort_keys, ascending=False
)
else:
df = self.backend.drop_duplicates(
df, keys=dedup_keys, sort_by=dedup_keys, ascending=True
)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments