Detect Cron Job Failures Automatically

Q: ### How quickly does CueAPI detect failures?

CueAPI detects delivery failures within 30 seconds (configurable timeout). Outcome failures are detected when the outcome deadline expires (default 300 seconds). Total detection time: under 6 minutes for most failure types.

Q: Can I customize retry logic for different types of failures?

Yes. Configure `max_attempts` and `backoff_minutes` per cue. Use fewer retries for authentication errors (permanent) and more retries for network timeouts (temporary). Set different outcome deadlines based on agent runtime.

Q: What happens to scheduled executions when a cue fails?

By default, future executions continue running. Set `"pause": true` in `on_failure` to stop scheduling after failure. You can resume the cue manually after fixing the issue. Use this for critical agents where continued failures waste resources.

Your agent runs every morning at 9 AM. It pulls data, processes it, and updates your dashboard. Yesterday it failed silently. You found out at 3 PM when a customer asked why their numbers were wrong. This is how to detect cron job failures automatically before your users do.

Most cron systems fire tasks and forget about them. They have no concept of success or failure. Your agent might crash, time out, or complete with errors. The scheduler marks it as "ran" and moves on. You need outcome tracking, not just execution tracking.

TL;DR: Replace platform cron jobs with CueAPI cues that track delivery, outcomes, and failures. Configure automatic retries and alerts. Your agents report success or failure. You get reliable failure detection for your scheduled tasks.

Key Takeaways: - CueAPI tracks 3 failure types: delivery failures (agent unreachable), execution failures (agent crashes), and outcome failures (agent reports failure) - Automatic retries with exponential backoff reduce false positives from temporary network issues - Email and webhook alerts notify you of confirmed failures - Evidence-based verification proves your agent actually completed business actions - Failed executions include error details and retry history for debugging

Traditional cron has no failure detection. It schedules, fires, and hopes. This tutorial shows you how to fix it with proper outcome tracking.

Step 1: Set Up Failure Detection with CueAPI

Create Your First Cue with Outcome Tracking

Replace your cron job with a CueAPI cue that tracks outcomes. Here's how to create a morning briefing agent with failure detection:

curl -X POST https://api.cueapi.ai/v1/cues \
  -H "Authorization: Bearer cue_sk_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "morning-briefing",
    "description": "Daily pipeline health check",
    "schedule": {
      "type": "recurring",
      "cron": "0 9 * * *",
      "timezone": "America/New_York"
    },
    "transport": "webhook",
    "callback": {
      "url": "https://your-agent.com/briefing",
      "method": "POST",
      "headers": {"Authorization": "Bearer your_secret"}
    },
    "payload": {"task": "generate_briefing"},
    "retry": {
      "max_attempts": 3,
      "backoff_minutes": [1, 5, 15]
    },
    "on_failure": {
      "email": true,
      "webhook": null,
      "pause": false
    }
  }'

Python equivalent using httpx:

import httpx

response = httpx.post(
    "https://api.cueapi.ai/v1/cues",
    headers={"Authorization": "Bearer cue_sk_your_key"},
    json={
        "name": "morning-briefing",
        "description": "Daily pipeline health check",
        "schedule": {
            "type": "recurring",
            "cron": "0 9 * * *",
            "timezone": "America/New_York"
        },
        "transport": "webhook",
        "callback": {
            "url": "https://your-agent.com/briefing",
            "method": "POST",
            "headers": {"Authorization": "Bearer your_secret"}
        },
        "payload": {"task": "generate_briefing"},
        "retry": {
            "max_attempts": 3,
            "backoff_minutes": [1, 5, 15]
        },
        "on_failure": {
            "email": True,
            "webhook": None,
            "pause": False
        }
    }
)

cue = response.json()
print(f"Created cue: {cue['id']}")

Expected output: ``json { "id": "cue_abc123", "name": "morning-briefing", "status": "active", "next_run": "2024-03-25T13:00:00Z" } ``

Configure Retry Logic and Timeout Settings

The retry configuration handles three failure types automatically:

Delivery failures: Your agent is unreachable. Network down, server crashed, wrong URL.
Execution timeouts: Your agent takes longer than expected to respond.
Outcome timeouts: Your agent responds but never reports success or failure.

⚠️ Warning: Set appropriate timeouts based on your agent's actual runtime. A data sync might take 10 minutes. A tweet generation takes 30 seconds.

📝 Note: Exponential backoff prevents overwhelming a failed service. First retry after 1 minute, second after 5 minutes, third after 15 minutes.

Step 2: Implement Outcome Reporting in Your Agent

Add Success/Failure Reporting to Agent Code

Your agent must report outcomes. Success means the business action completed. Failure means it did not. Here's a Flask endpoint that reports outcomes:

from flask import Flask, request
import httpx

app = Flask(__name__)

@app.route('/briefing', methods=['POST'])
def handle_briefing():
    execution_id = request.headers.get('X-CueAPI-Execution-ID')
    
    try:
        # Your agent logic here
        briefing = generate_morning_briefing()
        email_sent = send_briefing_email(briefing)
        
        # Report success with evidence
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": True,
                "result": f"Briefing sent to {email_sent['recipient_count']} recipients",
                "metadata": {"email_id": email_sent['batch_id']},
                "summary": "Morning briefing delivered successfully"
            }
        )
        
        return {"status": "completed"}, 200
        
    except Exception as e:
        # Report failure with error details
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": False,
                "error": str(e),
                "result": "Briefing generation failed",
                "summary": f"Failed: {type(e).__name__}"
            }
        )
        
        return {"error": str(e)}, 500

✅ Success indicator: Your agent responds with 200 AND reports outcome. Both required for success tracking.

Handle Edge Cases and Partial Failures

Real agents have partial failures. Data sync completes most records. Email campaign sends to most subscribers. Report these accurately:

def handle_partial_success():
    try:
        results = sync_customer_data()
        
        if results['success_rate'] > 0.9:
            # Mostly successful
            httpx.post(
                f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
                headers={"Authorization": "Bearer cue_sk_your_key"},
                json={
                    "success": True,
                    "result": f"Synced {results['synced']} of {results['total']} records",
                    "metadata": {
                        "success_rate": results['success_rate'],
                        "failed_records": results['failed']
                    },
                    "summary": f"Data sync completed with {results['success_rate']:.1%} success rate"
                }
            )
        else:
            # Mostly failed
            httpx.post(
                f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
                headers={"Authorization": "Bearer cue_sk_your_key"},
                json={
                    "success": False,
                    "error": f"Low success rate: {results['success_rate']:.1%}",
                    "result": f"Only synced {results['synced']} of {results['total']} records",
                    "summary": "Data sync failed - too many record errors"
                }
            )
            
    except Exception as e:
        # Complete failure
        report_failure(execution_id, str(e))

📝 Note: Define your own success thresholds based on your business requirements and acceptable failure rates.

Step 3: Configure Automatic Alert Notifications

Set Up Email Alerts for Failures

Email alerts notify you when retries are exhausted. Configure them when creating your cue:

cue_config = {
    "on_failure": {
        "email": True,
        "webhook": None,
        "pause": True  # Stop scheduling after failure
    }
}

You'll receive emails like this:

Subject: CueAPI Alert: morning-briefing failed

Execution cue_abc123_exec_456 failed after 3 retry attempts.

Agent: morning-briefing
Last error: Connection timeout after 30 seconds
Failed at: 2024-03-25 09:17:42 UTC
Retry history: 09:00 (timeout), 09:01 (timeout), 09:06 (timeout), 09:21 (exhausted)

View details: https://dashboard.cueapi.ai/executions/cue_abc123_exec_456

Add Webhook Notifications for Real-Time Monitoring

Send failure alerts to Slack, Discord, or your monitoring system:

cue_config = {
    "on_failure": {
        "email": True,
        "webhook": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
        "pause": False
    }
}

The webhook receives this payload:

{
  "event": "execution_failed",
  "cue_id": "cue_abc123",
  "execution_id": "cue_abc123_exec_456",
  "error": "Connection timeout after 30 seconds",
  "retry_count": 3,
  "failed_at": "2024-03-25T09:17:42Z",
  "next_attempt": null
}

⚠️ Warning: Webhook failures don't retry. Make your webhook endpoint reliable or use email as backup.

Step 4: Monitor and Debug Failed Executions

Access Execution Logs via Dashboard

View failed executions in the CueAPI dashboard at https://dashboard.cueapi.ai/cues. Each failed execution shows error messages, retry attempt timeline, and outcome reporting attempts.

Query Failed Executions via API

Get failed executions programmatically:

curl "https://api.cueapi.ai/v1/executions?status=failed&cue_id=cue_abc123" \
  -H "Authorization: Bearer cue_sk_your_key"

response = httpx.get(
    "https://api.cueapi.ai/v1/executions",
    headers={"Authorization": "Bearer cue_sk_your_key"},
    params={"status": "failed", "cue_id": "cue_abc123"}
)

failed_executions = response.json()
for execution in failed_executions['data']:
    print(f"Failed: {execution['id']} - {execution['error']}")

Expected output: ``json { "data": [ { "id": "cue_abc123_exec_456", "status": "failed", "error": "Connection timeout after 30 seconds", "retry_count": 3, "failed_at": "2024-03-25T09:17:42Z" } ] } ``

Step 5: Beyond Basic Detection: Evidence-Based Verification

Report Detailed Outcomes for Business Actions

Your agent says it tweeted. Include the tweet ID in your outcome:

def post_marketing_tweet():
    try:
        # Send the tweet
        tweet = twitter_api.create_tweet("Your marketing message here")
        
        # Report success with evidence
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": True,
                "result": f"Marketing tweet posted: {tweet['id']}",
                "metadata": {
                    "tweet_id": tweet['id'],
                    "tweet_url": f"https://twitter.com/yourcompany/status/{tweet['id']}"
                },
                "summary": "Daily marketing content published"
            }
        )
        
    except Exception as e:
        httpx.post(
            f"https://api.cueapi.ai/v1/executions/{execution_id}/outcome",
            headers={"Authorization": "Bearer cue_sk_your_key"},
            json={
                "success": False,
                "error": str(e),
                "summary": f"Tweet failed: {type(e).__name__}"
            }
        )

Verify Agent Success with Real Proof

Include proof the business action happened in your outcome metadata:

Tweet ID: Proves the tweet exists
Email batch ID: Proves emails were queued
Stripe charge ID: Proves payment was processed
File URL: Proves the report was generated

This separates "agent says it worked" from "agent proved it worked" with verifiable evidence.

Common Failure Detection Patterns

Timeout vs Execution vs Outcome Failures

CueAPI tracks distinct failure types with different retry strategies:

Delivery failures: Agent unreachable due to network or server issues

Execution failures: Agent crashes or returns error status

Outcome failures: Agent doesn't report success/failure within expected timeframe

Retry Exhaustion vs Permanent Failures

Retry exhaustion: Temporary issues that resolve

Network timeouts
Rate limiting
Service temporarily unavailable

Permanent failures: Code issues that need fixes

Authentication errors
Invalid API endpoints
Logic errors in agent code

Configure different retry strategies for different failure patterns. Network issues get 3 retries. Authentication errors get 1 retry.

📝 Note: Check exponential backoff for retry timing best practices. Start with 1 minute, then 5 minutes, then 15 minutes.

This approach eliminates the accountability gap between your agent running and you knowing it worked. Your agents now report their work. You know they succeeded. You can trust your infrastructure and get back to building.

Try it yourself. Free tier available. Sign up for CueAPI.

Frequently Asked Questions

How quickly does CueAPI detect failures?

CueAPI detects failures based on your configured timeouts and retry settings. Set appropriate timeouts based on your agent's expected runtime.

What's the difference between execution failure and outcome failure?

Execution failure means your agent crashed, returned an error status, or timed out. Outcome failure means your agent responded successfully but never called the outcome endpoint to report success or failure.

Can I customize retry logic for different types of failures?

Yes. Configure max_attempts and backoff_minutes per cue. Use fewer retries for authentication errors (permanent) and more retries for network timeouts (temporary).

How do I prevent false positives from network blips?

Use exponential backoff with at least 2 retry attempts. Set appropriate timeout values for network requests. Configure realistic timeouts longer than your agent's typical runtime plus a buffer.

What happens to scheduled executions when a cue fails?

By default, future executions continue running. Set "pause": true in on_failure to stop scheduling after failure. You can resume the cue manually after fixing the issue. Use this for critical agents where continued failures waste resources.

Cron Job Monitoring Alerts Setup: Stop Silent Failures

Sources

CueAPI REST API: Complete scheduling API for AI agents: https://docs.cueapi.ai/api-reference/overview/
Cron specification: Interactive cron expression builder: https://crontab.guru/
Webhook best practices: Security and reliability guidelines: https://webhooks.fyi/best-practices/
Exponential backoff: Mathematical approach to retry timing: https://en.wikipedia.org/wiki/Exponential_backoff

About the author: Govind Kavaturi is co-founder of Vector, a portfolio of AI-native products. He believes the next phase of the internet is built for agents, not humans.

Detect Cron Job Failures Automatically: Complete Tutorial

Step 1: Set Up Failure Detection with CueAPI

Create Your First Cue with Outcome Tracking

Configure Retry Logic and Timeout Settings

Step 2: Implement Outcome Reporting in Your Agent

Add Success/Failure Reporting to Agent Code

Handle Edge Cases and Partial Failures

Step 3: Configure Automatic Alert Notifications

Set Up Email Alerts for Failures

Add Webhook Notifications for Real-Time Monitoring

Step 4: Monitor and Debug Failed Executions

Access Execution Logs via Dashboard

Query Failed Executions via API

Step 5: Beyond Basic Detection: Evidence-Based Verification

Report Detailed Outcomes for Business Actions

Verify Agent Success with Real Proof

Common Failure Detection Patterns

Timeout vs Execution vs Outcome Failures

Retry Exhaustion vs Permanent Failures

Frequently Asked Questions

How quickly does CueAPI detect failures?

What's the difference between execution failure and outcome failure?

Can I customize retry logic for different types of failures?

How do I prevent false positives from network blips?

What happens to scheduled executions when a cue fails?

Sources

Related Articles

Continue Learning

Detect Cron Job Failures Automatically: Complete Tutorial

Step 1: Set Up Failure Detection with CueAPI

Create Your First Cue with Outcome Tracking

Configure Retry Logic and Timeout Settings

Step 2: Implement Outcome Reporting in Your Agent

Add Success/Failure Reporting to Agent Code

Handle Edge Cases and Partial Failures

Step 3: Configure Automatic Alert Notifications

Set Up Email Alerts for Failures

Add Webhook Notifications for Real-Time Monitoring

Step 4: Monitor and Debug Failed Executions

Access Execution Logs via Dashboard

Query Failed Executions via API

Step 5: Beyond Basic Detection: Evidence-Based Verification

Report Detailed Outcomes for Business Actions

Verify Agent Success with Real Proof

Common Failure Detection Patterns

Timeout vs Execution vs Outcome Failures

Retry Exhaustion vs Permanent Failures

Frequently Asked Questions

How quickly does CueAPI detect failures?

What's the difference between execution failure and outcome failure?

Can I customize retry logic for different types of failures?

How do I prevent false positives from network blips?

What happens to scheduled executions when a cue fails?

Related Articles

Sources

Related Articles

Scheduled Task Failed: Why Agents Die Silently

Webhook vs Cron for Automated Workflows: Developer's Guide

Cron Job Monitoring Is Not Enough: Why AI Agents Need More

Continue Learning