Skip to content

Slack Bot Intermittently Missing Workflow Submission Events #1440

@RajuKumar077

Description

@RajuKumar077

I'm experiencing an intermittent issue where my Slack bot (built with slack-bolt Python async framework) sometimes fails to receive or process workflow submission events. The bot works correctly most of the time, but occasionally workflow submissions don't trigger the bot's event handlers.

Environment:
Framework: slack-bolt (Python, async version)
Connection: Socket Mode with AsyncSocketModeHandler
Event Types: Listening to message events and app_mention events
Deployment: Dockerized application running in Kubernetes
Python Version: 3.11
Current Implementation Pattern:

from slack_bolt.async_app import AsyncApp
from slack_bolt.adapter.socket_mode.async_handler import AsyncSocketModeHandler

app = AsyncApp(token=SLACK_BOT_TOKEN)

@app.event("message")
async def handle_messages(event, body, say, client, logger, ack):
    await ack()
    # Process workflow submissions from message events
    # ... detection and processing logic ...

@app.event("app_mention")
async def handle_mentions(event, body, say, client, logger, ack):
    await ack()
    # Delegate to main handler
    # ... processing logic ...

async def main():
    handler = AsyncSocketModeHandler(
        app=app,
        app_token=SLACK_APP_TOKEN,
        trace_enabled=True,
        ping_interval=30,
        auto_reconnect_enabled=True
    )
    await handler.start_async()

if __name__ == "__main__":
    asyncio.run(main())

Symptoms:
Intermittent Failures: About 5-10% of workflow submissions don't trigger the bot (no logs, no event received)
No Pattern: Failures seem random - same user, same channel, same workflow can work one time and fail the next
Silent Failures: When it fails, there's no error in logs - the event simply never arrives at the bot
Bot Appears Online: The bot shows as active in Slack, and other messages/mentions work fine
User Experience: Users submit the workflow, but receive no response (no acknowledgment, no ticket created)
What I've Already Tried:
✅ Immediate ACK: I'm calling await ack() immediately at the start of handlers
✅ Auto-Reconnect: Enabled auto_reconnect_enabled=True in Socket Mode handler
✅ Ping Interval: Set ping_interval=30 to maintain connection
✅ Trace Logging: Enabled trace_enabled=True for debugging
✅ Error Handling: Comprehensive try/except blocks around all async operations
✅ Deduplication: Implemented deduplication to prevent processing same event twice
✅ Bot Permissions: Verified bot has all required scopes (channels:history, chat:write, users:read, etc.)
Observations:
When the bot does receive the event, everything works perfectly
The issue seems to be at the event delivery level, not processing
Restarting the bot doesn't consistently fix the issue
No correlation with server load, time of day, or specific channels
Slack's event delivery status shows "delivered" even when bot doesn't receive it
Questions:
Event Delivery Reliability: Are there known issues with Socket Mode event delivery reliability? Should I expect 100% delivery or is some loss expected?

Connection Health Monitoring: How can I monitor the Socket Mode connection health? Are there callbacks or metrics I should be tracking?

Reconnection Strategy: When auto_reconnect_enabled=True, how does the reconnection work? Could events be lost during reconnection?

Event Buffering: Does Slack buffer events during brief disconnections, or are they lost? Is there a way to request missed events?

Alternative Patterns: Should I be using a different event subscription pattern for critical workflows? Would HTTP-based Events API be more reliable than Socket Mode?

Debugging Approach: What's the best way to debug why specific events aren't being received? Are there Slack-side logs I can access?

Workflow-Specific Events: Is there a more reliable way to listen specifically for Workflow Builder submissions rather than generic message events?

Heartbeat/Health Check: Should I implement a custom heartbeat mechanism to detect connection issues proactively?

Additional Context:
The workflow submissions create rich text messages with structured blocks
I'm detecting workflows by analyzing the blocks structure in message events
The bot needs to be highly reliable as it creates support tickets - missed events mean lost user requests
I've considered implementing a fallback mechanism where users can retry, but ideally the bot should catch every submission
What I'm Looking For:
Best practices for ensuring reliable event delivery in production Slack bots
Known issues or limitations with Socket Mode that I should be aware of
Recommended monitoring and alerting strategies
Alternative approaches to ensure no workflow submissions are missed
Any insights, suggestions, or similar experiences would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions