← Resources
tutorial·
Apr 20, 2026·8 min

Detect Cron Job Failures Automatically: Complete Tutorial

By Govind Kavaturi

Dashboard showing automatic cron job failure detection with alerts and retry logic

Your agent runs every morning at 9 AM. It pulls data, processes it, and updates your dashboard. Yesterday it failed silently. You found out at 3 PM when a customer asked why their numbers were wrong. This is how to detect cron job failures automatically before your users do.

Most cron systems fire tasks and forget about them. They have no concept of success or failure. Your agent might crash, time out, or complete with errors. The scheduler marks it as "ran" and moves on. You need outcome tracking, not just execution tracking.

TL;DR: Replace platform cron jobs with CueAPI cues that track delivery, outcomes, and failures. Configure automatic retries and alerts. Your agents report success or failure. You get reliable failure detection for your scheduled tasks.

Key Takeaways: - CueAPI tracks 3 failure types: delivery failures (agent unreachable), execution failures (agent crashes), and outcome failures (agent reports failure) - Automatic retries with exponential backoff reduce false positives from temporary network issues - Email and webhook alerts notify you of confirmed failures - Evidence-based verification proves your agent actually completed business actions - Failed executions include error details and retry history for debugging

Traditional cron has no failure detection. It schedules, fires, and hopes. This tutorial shows you how to fix it with proper outcome tracking.

Step 1: Set Up Failure Detection with CueAPI

Create Your First Cue with Outcome Tracking

Replace your cron job with a CueAPI cue that tracks outcomes. Here's how to create a morning briefing agent with failure detection:

curl -X POST https://api.cueapi.ai/v1/cues \
  -H "Authorization: Bearer cue_sk_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "morning-briefing",
    "description": "Daily pipeline health check",
    "schedule": {
      "type": "recurring",
      "cron": "0 9 * * *",
      "timezone": "America/New_York"
    },
    "transport": "webhook",
    "callback": {
      "url": "https://your-agent.com/briefing",
      "method": "POST",
      "headers": {"Authorization": "Bearer your_secret"}
    },
    "payload": {"task": "generate_briefing"},
    "retry": {
      "max_attempts": 3,
      "backoff_minutes": [1, 5, 15]
    },
    "on_failure": {
      "email": true,
      "webhook": null,
      "pause": false
    }
  }'

Python equivalent using httpx:

import httpx

response = httpx.post(
    "https://api.cueapi.ai/v1/cues",
    headers={"Authorization": "Bearer cue_sk_your_key"},
    json={
        "name": "morning-briefing",
        "description": "Daily pipeline health check",
        "schedule": {
            "type": "recurring",
            "cron": "0 9 * * *",
            "timezone": "America/New_York"
        },
        "transport": "webhook",
        "callback": {
            "url": "https://your-agent.com/briefing",
            "method": "POST",
            "headers": {"Authorization": "Bearer your_secret"}
        },
        "payload": {"task": "generate_briefing"},
        "retry": {
            "max_attempts": 3,
            "backoff_minutes": [1, 5, 15]
        },
        "on_failure": {
            "email": True,
            "webhook": None,
            "pause": False
        }
    }
)

cue = response.json()
print(f"Created cue: {cue['id']}")

Expected output: ``json { "id": "cue_abc123", "name": "morning-briefing", "status": "active", "next_run": "2024-03-25T13:00:00Z" } ``

Configure Retry Logic and Timeout Settings

The retry configuration handles three failure types automatically:

  • Delivery failures: Your agent is unreachable. Network down, server crashed, wrong URL.
  • Execution timeouts: Your agent takes longer than expected to respond.
  • Outcome timeouts: Your agent responds but never reports success or failure.

⚠️ Warning: Set appropriate timeouts based on your agent's actual runtime. A data sync might take 10 minutes. A tweet generation takes 30 seconds.

📝 Note: Exponential backoff prevents overwhelming a failed service. First retry after 1 minute, second after 5 minutes, third after 15 minutes.

Step 2: Implement Outcome Reporting in Your Agent

Add Success/Failure Reporting to Agent Code

Your agent must report outcomes. Success means the business action completed. Failure means it did not. Here's a Flask endpoint that reports outcomes:

from flask import Flask, request
import httpx

app = Flask(__name__)

@app.route('/briefing', methods=['POST'])
def handle_briefing():
    execution_id = request.headers.get('X-CueAPI-Execution-ID')
    
    try:
        # Your agent logic here
        briefing = generate_morning_briefing()
        email_sent = send_briefing_email(briefing)
        
        # Report success with evidence
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": True,
                "result": f"Briefing sent to {email_sent['recipient_count']} recipients",
                "metadata": {"email_id": email_sent['batch_id']},
                "summary": "Morning briefing delivered successfully"
            }
        )
        
        return {"status": "completed"}, 200
        
    except Exception as e:
        # Report failure with error details
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": False,
                "error": str(e),
                "result": "Briefing generation failed",
                "summary": f"Failed: {type(e).__name__}"
            }
        )
        
        return {"error": str(e)}, 500

Success indicator: Your agent responds with 200 AND reports outcome. Both required for success tracking.

Handle Edge Cases and Partial Failures

Real agents have partial failures. Data sync completes most records. Email campaign sends to most subscribers. Report these accurately:

def handle_partial_success():
    try:
        results = sync_customer_data()
        
        if results['success_rate'] > 0.9:
            # Mostly successful
            httpx.post(
                f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
                headers={"Authorization": "Bearer cue_sk_your_key"},
                json={
                    "success": True,
                    "result": f"Synced {results['synced']} of {results['total']} records",
                    "metadata": {
                        "success_rate": results['success_rate'],
                        "failed_records": results['failed']
                    },
                    "summary": f"Data sync completed with {results['success_rate']:.1%} success rate"
                }
            )
        else:
            # Mostly failed
            httpx.post(
                f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
                headers={"Authorization": "Bearer cue_sk_your_key"},
                json={
                    "success": False,
                    "error": f"Low success rate: {results['success_rate']:.1%}",
                    "result": f"Only synced {results['synced']} of {results['total']} records",
                    "summary": "Data sync failed - too many record errors"
                }
            )
            
    except Exception as e:
        # Complete failure
        report_failure(execution_id, str(e))

📝 Note: Define your own success thresholds based on your business requirements and acceptable failure rates.

Step 3: Configure Automatic Alert Notifications

Set Up Email Alerts for Failures

Email alerts notify you when retries are exhausted. Configure them when creating your cue:

cue_config = {
    "on_failure": {
        "email": True,
        "webhook": None,
        "pause": True  # Stop scheduling after failure
    }
}

You'll receive emails like this:

Subject: CueAPI Alert: morning-briefing failed

Execution cue_abc123_exec_456 failed after 3 retry attempts.

Agent: morning-briefing
Last error: Connection timeout after 30 seconds
Failed at: 2024-03-25 09:17:42 UTC
Retry history: 09:00 (timeout), 09:01 (timeout), 09:06 (timeout), 09:21 (exhausted)

View details: https://dashboard.cueapi.ai/executions/cue_abc123_exec_456

Add Webhook Notifications for Real-Time Monitoring

Send failure alerts to Slack, Discord, or your monitoring system:

cue_config = {
    "on_failure": {
        "email": True,
        "webhook": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
        "pause": False
    }
}

The webhook receives this payload:

{
  "event": "execution_failed",
  "cue_id": "cue_abc123",
  "execution_id": "cue_abc123_exec_456",
  "error": "Connection timeout after 30 seconds",
  "retry_count": 3,
  "failed_at": "2024-03-25T09:17:42Z",
  "next_attempt": null
}

⚠️ Warning: Webhook failures don't retry. Make your webhook endpoint reliable or use email as backup.

Step 4: Monitor and Debug Failed Executions

Access Execution Logs via Dashboard

View failed executions in the CueAPI dashboard at https://dashboard.cueapi.ai/cues. Each failed execution shows error messages, retry attempt timeline, and outcome reporting attempts.

Query Failed Executions via API

Get failed executions programmatically:

curl "https://api.cueapi.ai/v1/executions?status=failed&cue_id=cue_abc123" \
  -H "Authorization: Bearer cue_sk_your_key"
response = httpx.get(
    "https://api.cueapi.ai/v1/executions",
    headers={"Authorization": "Bearer cue_sk_your_key"},
    params={"status": "failed", "cue_id": "cue_abc123"}
)

failed_executions = response.json()
for execution in failed_executions['data']:
    print(f"Failed: {execution['id']} - {execution['error']}")

Expected output: ``json { "data": [ { "id": "cue_abc123_exec_456", "status": "failed", "error": "Connection timeout after 30 seconds", "retry_count": 3, "failed_at": "2024-03-25T09:17:42Z" } ] } ``

Step 5: Beyond Basic Detection: Evidence-Based Verification

Report Detailed Outcomes for Business Actions

Your agent says it tweeted. Include the tweet ID in your outcome:

def post_marketing_tweet():
    try:
        # Send the tweet
        tweet = twitter_api.create_tweet("Your marketing message here")
        
        # Report success with evidence
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": True,
                "result": f"Marketing tweet posted: {tweet['id']}",
                "metadata": {
                    "tweet_id": tweet['id'],
                    "tweet_url": f"https://twitter.com/yourcompany/status/{tweet['id']}"
                },
                "summary": "Daily marketing content published"
            }
        )
        
    except Exception as e:
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": False,
                "error": str(e),
                "summary": f"Tweet failed: {type(e).__name__}"
            }
        )

Verify Agent Success with Real Proof

Include proof the business action happened in your outcome metadata:

  • Tweet ID: Proves the tweet exists
  • Email batch ID: Proves emails were queued
  • Stripe charge ID: Proves payment was processed
  • File URL: Proves the report was generated

This separates "agent says it worked" from "agent proved it worked" with verifiable evidence.

Common Failure Detection Patterns

Timeout vs Execution vs Outcome Failures

CueAPI tracks distinct failure types with different retry strategies:

Delivery failures: Agent unreachable due to network or server issues

Execution failures: Agent crashes or returns error status

Outcome failures: Agent doesn't report success/failure within expected timeframe

Retry Exhaustion vs Permanent Failures

Retry exhaustion: Temporary issues that resolve

  • Network timeouts
  • Rate limiting
  • Service temporarily unavailable

Permanent failures: Code issues that need fixes

  • Authentication errors
  • Invalid API endpoints
  • Logic errors in agent code

Configure different retry strategies for different failure patterns. Network issues get 3 retries. Authentication errors get 1 retry.

📝 Note: Check exponential backoff for retry timing best practices. Start with 1 minute, then 5 minutes, then 15 minutes.

This approach eliminates the accountability gap between your agent running and you knowing it worked. Your agents now report their work. You know they succeeded. You can trust your infrastructure and get back to building.

Try it yourself. Free tier available. Sign up for CueAPI.

Frequently Asked Questions

How quickly does CueAPI detect failures?

CueAPI detects failures based on your configured timeouts and retry settings. Set appropriate timeouts based on your agent's expected runtime.

What's the difference between execution failure and outcome failure?

Execution failure means your agent crashed, returned an error status, or timed out. Outcome failure means your agent responded successfully but never called the outcome endpoint to report success or failure.

Can I customize retry logic for different types of failures?

Yes. Configure max_attempts and backoff_minutes per cue. Use fewer retries for authentication errors (permanent) and more retries for network timeouts (temporary).

How do I prevent false positives from network blips?

Use exponential backoff with at least 2 retry attempts. Set appropriate timeout values for network requests. Configure realistic timeouts longer than your agent's typical runtime plus a buffer.

What happens to scheduled executions when a cue fails?

By default, future executions continue running. Set "pause": true in on_failure to stop scheduling after failure. You can resume the cue manually after fixing the issue. Use this for critical agents where continued failures waste resources.

Sources

  • CueAPI REST API: Complete scheduling API for AI agents: https://docs.cueapi.ai/api-reference/overview/
  • Cron specification: Interactive cron expression builder: https://crontab.guru/
  • Webhook best practices: Security and reliability guidelines: https://webhooks.fyi/best-practices/
  • Exponential backoff: Mathematical approach to retry timing: https://en.wikipedia.org/wiki/Exponential_backoff

About the author: Govind Kavaturi is co-founder of Vector, a portfolio of AI-native products. He believes the next phase of the internet is built for agents, not humans.

Get started

pip install cueapi
Get API Key →

Related Articles

Continue Learning

Start hereSchedule Your First Agent Task in 5 Minutes
How do I know if my agent ran successfully?
Ctrl+K