How do I monitor Multi-Protocol Gateway service health?
DataPower Multi-Protocol Gateway (MPG) services are the core runtime components processing integration traffic (REST APIs, SOAP web services, EDI X12/EDIFACT, MQ messages). Service health monitoring detects crashes, manual stops, and configuration issues before they impact business operations.
Service Health Monitoring via SOMA API
The Nodinite DataPower Monitoring Agent polls service status using SOMA (SOAP Management) XML Management Interface.
Step 1: Create Service Resource in Nodinite
- Navigate: Nodinite Web Client → Repository → Monitoring Resources
- Create New Resource:
- Resource type: Service
- DataPower appliance:
Prod-Primary
(or appliance name) - Domain:
TradingPartner
(DataPower domain hosting the service) - Service name:
TradingPartner-MPG
(exact service name as configured in DataPower) - Service class:
MultiProtocolGateway
(DataPower object class)
Step 2: Configure Agent Polling Interval
- Set polling frequency:
- Default: 5 minutes (288 health checks per day)
- High-priority services: 1 minute (1,440 health checks per day, faster failure detection)
- Low-priority development services: 15 minutes (96 health checks per day, reduced network overhead)
Step 3: SOMA API Request/Response
Agent sends SOMA XML request every 5 minutes:
<dp:request domain="TradingPartner">
<dp:get-status class="MultiProtocolGateway"/>
<dp:filter>TradingPartner-MPG</dp:filter>
</dp:request>
DataPower responds with service status:
<dp:response>
<dp:status class="MultiProtocolGateway">
<Name>TradingPartner-MPG</Name>
<OpState>up</OpState>
<AdminState>enabled</AdminState>
<ConfigState>saved</ConfigState>
<QuiesceState>normal</QuiesceState>
</dp:status>
</dp:response>
Step 4: OpState Values and Meanings
The agent parses the <OpState>
element to determine service health:
OpState Value | Meaning | Typical Causes |
---|---|---|
up | Service running normally | Healthy state, processing traffic |
down | Service crashed/failed | OutOfMemoryError, configuration error, backend unreachable |
stopped | Service manually disabled | Administrator disabled via WebGUI, planned maintenance |
starting | Service initializing | Appliance rebooting, service recently enabled (transient state) |
Step 5: Threshold Evaluation
Agent compares actual OpState vs expected state:
Scenario 1: Service crashed unexpectedly
- Expected state:
running
(24/7 production service) - Actual OpState:
down
- Alert: Error alert fires → "Service TradingPartner-MPG crashed unexpectedly at 2024-10-16 14:23:47 UTC"
- Actions: PagerDuty page on-call engineer, investigate service logs via Remote Action "View Service Logs"
Scenario 2: Service manually stopped (unexpected)
- Expected state:
running
(24/7 production service) - Actual OpState:
stopped
- Alert: Warning alert fires → "Service TradingPartner-MPG manually disabled, investigate if intentional"
- Actions: Email operations team, verify if planned maintenance (if not, escalate to network ops)
Scenario 3: Service stopped during scheduled maintenance (expected)
- Expected state:
stopped Saturday 2-6 AM
(configured maintenance window) - Actual OpState:
stopped
(Saturday 3:15 AM) - Alert: No alert (expected state matches actual state)
Scenario 4: Service stuck in "starting" state
- Expected state:
running
- Actual OpState:
starting
(15 minutes elapsed) - Alert: Warning alert fires → "Service TradingPartner-MPG stuck starting for 15 minutes, possible configuration issue"
- Actions: Investigate DataPower logs, check backend dependencies (database connections, MQ queue managers)
Expected State Configuration
Configure per-service expected state for intelligent alerting:
Production Services (24/7 uptime)
- Expected state:
Running 24/7
- Alert if: OpState = down/stopped any time
- Use case: Payment gateway, customer-facing APIs, partner EDI connections
Development Services (Business hours only)
- Expected state:
Running Mon-Fri 8 AM - 6 PM, Stopped outside business hours + weekends
- Alert if:
- OpState = stopped during business hours (should be running)
- OpState = running outside business hours (wasting resources, potential security issue)
- Use case: Development/QA environments with limited operating hours
Scheduled Maintenance Windows
- Expected state:
Running except Saturday 2-6 AM weekly
- Alert if: OpState = down/stopped outside maintenance window
- Use case: Production services with scheduled patching/backups
Alert Email Example
When service crashes unexpectedly, operations team receives email:
Subject: CRITICAL: DataPower Service TradingPartner-MPG DOWN
Body:
Alert: DataPower service failure detected
Appliance: Prod-Primary
Domain: TradingPartner
Service Name: TradingPartner-MPG
Service Class: MultiProtocolGateway
Previous State: up (running normally)
Current State: down (service crashed)
State Change Time: 2024-10-16 14:23:47 UTC
Expected State: Running 24/7 (production service)
Possible Causes:
- OutOfMemoryError (Java heap exhaustion from memory leak)
- Configuration error (invalid backend URL, missing certificate)
- Backend service unreachable (database down, MQ queue manager stopped)
Immediate Actions:
1. Check service logs via Nodinite Remote Action "View Service Logs"
2. Review recent configuration changes in DataPower domain "TradingPartner"
3. Verify backend service availability (database ping, MQ queue manager status)
4. Restart service if transient issue, escalate to development team if recurring
View service health history in Nodinite Monitor View:
https://nodinite.company.com/monitor/datapower-services/TradingPartner-MPG
Last known good state: 2024-10-16 14:18:32 UTC (5 minutes ago)
Service uptime (last 30 days): 99.87% (3 outages totaling 56 minutes)
Scenario: Manufacturing EDI Service Outage
Challenge: Manufacturing company processes EDI X12 850 Purchase Orders from customers via DataPower Multi-Protocol Gateway. Service crashes caused undetected outages (customers unable to send orders, manual health checks only twice daily).
Problem:
- Nov 2, 2023, 6:15 AM: TradingPartner-MPG service crashed (OutOfMemoryError from memory leak in XSLT transformation)
- Service down 6 hours (next manual health check at 12:00 PM discovered outage)
- Customer impact: 47 Purchase Orders delayed (customers frustrated, switched to competitors)
- Revenue impact: $25K SLA penalty (guaranteed 99.9% uptime, actual 99.1% that month)
Solution:
- Configured service health monitoring with 5-minute polling interval
- Set expected state "Running 24/7" (production service)
- Alert routing: Error alerts → PagerDuty page on-call engineer (escalate after 15 minutes if not acknowledged)
Results:
- 6-minute outage detection (next poll after crash: 6:15 AM crash → 6:20 AM alert fired)
- $25K SLA penalty avoidance (service uptime 99.95%, exceeds 99.9% SLA requirement)
- 47 delayed orders prevented (on-call engineer acknowledged alert 6:22 AM, restarted service 6:28 AM, 13-minute total outage)
Related Topics
- Prevent SLA Penalties - Service Monitoring - More real-world scenarios and best practices
- DataPower Monitoring Agent Installation - Step-by-step resource creation guide, SOMA API configuration
- Remote Actions Without SSH Access - View Service Logs, check service error messages without SSH access
- Alert Plugins Configuration - Configure PagerDuty for on-call engineer escalation
Next Steps
- Create Resource: Set up service health monitoring for your critical DataPower services
- Configure Polling: Set 5-minute polling interval for production services, adjust for development
- Set Expected States: Configure per-service expected state (24/7 vs business hours)
- Alert Routing: Configure email/Slack/PagerDuty alerts for service failures
- Monitor Dashboard: Create a service health dashboard to track uptime trends
For more scenarios: