Skip to content

DIALOGE — O: Operations (Solution Health)

A solution that works on launch day but has no operational model is not an enterprise solution. It is a countdown to an incident.

TL;DR

Design for operability before go-live — not after the first incident. Connect Application Insights to every production canvas app and critical flow. Build meaningful error messages. Implement alerting for failures. Define a support model (L1–L4). Plan for maintenance, deprecation, and eventual retirement. Solutions are products, not projects.

Applies To

Audience: Solution Engineer · Solution Owner · Platform Lead BOLT Tiers: Tier 2–4 (Tier 1 covered by platform monitoring) Maturity: Intermediate → Advanced Frameworks: DIALOGE · SCALE-OPS (Operations — platform-level)

Solution Operations vs Platform Operations

This page covers solution-level operational health — monitoring, support, and maintenance of individual solutions. For platform-level operational health (tenant capacity, environment management, CoE operations, service health), see SCALE-OPS Operations →.


What Operations Means in DIALOGE

Operations is what happens after Go-Live. It is the discipline of keeping a solution healthy, reliable, and trusted over time — through monitoring, support, incident response, maintenance, and eventually retirement.

Most solutions are designed for launch. Few are designed for what comes after. The canvas app that served fifty users in the first month serving five hundred in the second. The flow that processed ten records a day now processing a thousand. The plugin that worked perfectly in testing producing intermittent failures in production that nobody can explain because there is no telemetry.

Enterprise solutions are not projects — they are products. Products need operational discipline from day one, not bolted on after the first incident.

The distinction from SCALE-OPS Operations: SCALE-OPS Operations covers platform-level operational health — tenant capacity, environment management, CoE operations, service health across the entire Power Platform estate. DIALOGE Operations covers solution-level operational health — the monitoring, support, and maintenance of individual solutions running on that platform. Different scope, complementary concerns. A well-operated platform hosting poorly operated solutions is still a liability.


The Operations Mindset — Design for Operability

The single most effective investment in operations is made before go-live — in the design decisions that determine how observable, supportable, and maintainable a solution will be.

Operability is a design requirement, not a post-launch task. Solutions that are hard to monitor, diagnose, and support are hard to operate — and that difficulty compounds over time as the solution grows and the original builders move on.

Design for operability means:

  • Telemetry by design — build logging and instrumentation into the solution from the start, not as an afterthought when something breaks
  • Meaningful errors — errors that tell users and support teams what happened and what to do next, not generic "something went wrong" messages
  • Graceful degradation — solutions that behave sensibly when dependencies fail, rather than crashing or corrupting data
  • Configuration over code — separating operational settings (thresholds, recipient lists, environment-specific values) from logic so operations teams can adjust them without a developer and a deployment cycle
  • Self-service first — designing for users to resolve common issues themselves, reducing the support burden on the solution team
  • Operational documentation — runbooks, support guides, and known issue logs maintained alongside the solution, not as a post-launch documentation project that never happens

Observability Toolkit

Knowing when a solution is unhealthy before users report it is the operational ideal. Power Platform provides a layered observability toolkit — each layer providing a different type of signal at a different level of granularity.

Application Insights

Application Insights is the enterprise telemetry platform for Power Platform solutions. Canvas apps and cloud flows can be configured to emit telemetry to an Azure Application Insights instance — providing custom dashboards, error rate tracking, latency monitoring, user session analytics, and proactive alerting.

What Application Insights provides beyond native Power Platform monitoring: - Custom error tracking — every unhandled error in a canvas app logged with context, user, and session details - Performance monitoring — page load times, formula execution times, connector response times - Usage analytics — which screens are used, which flows are triggered, by whom and how often - Anomaly detection — automated alerts when error rates or latency deviate from baseline - Correlation across components — tracing a user action in a canvas app through a flow invocation to a Dataverse operation

The enterprise requirement: Every production canvas app and every critical cloud flow should have Application Insights configured before go-live. Attempting to diagnose production issues without telemetry is significantly harder than with it — and retroactively adding telemetry to a running production solution introduces deployment risk.

Practical guidance: - Use a dedicated Application Insights instance per solution or solution group — not a shared instance that mixes telemetry from unrelated solutions - Define custom events for business-significant actions — not just technical errors. Knowing that an approval was submitted, completed, or timed out is operationally valuable. - Set alert rules in Application Insights for error rate thresholds, availability failures, and performance degradation — proactive notification before users report problems

Plugin Trace Log

The plugin trace log captures server-side execution details for Dataverse plugins and low-code plugins — the diagnostic surface for understanding what happened on the server when a record operation failed or behaved unexpectedly.

Enabling and using plugin trace logs: Plugin trace logging must be enabled at the environment level — it is off by default in production environments for performance reasons. For production solutions with complex plugin logic, enable trace logging at the Exception level (logs only when plugins throw errors) as a permanent setting, and escalate to All for targeted diagnostic sessions.

What the plugin trace log captures: - Plugin execution context — which entity, which message, which stage, which user - Exception details — the full stack trace when a plugin throws an error - Custom trace messages — developers can write diagnostic information to the trace log from within plugin code, providing execution context that is invaluable during incident investigation

The enterprise implication: Plugin failures surface to users as cryptic Dataverse error messages unless the plugin explicitly handles and communicates errors. A plugin trace log entry is often the first — and sometimes only — diagnostic artefact available when investigating a production plugin failure. Treat plugin trace logging as essential infrastructure for any solution with server-side plugin logic.

System Job Logs

System jobs are Dataverse's asynchronous processing framework — handling async plugin execution, bulk delete operations, calculated field updates, email processing, and other background operations.

Monitoring system jobs: Failed system jobs appear in the System Jobs view in the Power Platform Admin Center and within Dataverse itself. For solutions that depend on async processing — bulk operations, scheduled jobs, async plugins — system job monitoring must be part of the operational process.

Common system job failure patterns: - Async plugins that fail due to transient dependency errors — retry logic reduces but does not eliminate these - Bulk delete jobs that fail due to record locks or cascade rule conflicts - Calculated rollup column refresh jobs that fall behind under high write volume

A systematic review of failed system jobs — weekly for standard solutions, daily for high-volume or mission-critical solutions — surfaces operational issues before they accumulate into data integrity problems.

Flow Run History

Every Power Automate cloud flow run is logged with execution status (succeeded, failed, cancelled), trigger time, duration, and for failed runs, the specific action that failed and the error details.

Operational use of flow run history: - Daily review of failed runs for critical flows — a single failed run in a nightly batch process may have skipped records that need manual remediation - Duration trend analysis — flows that are taking progressively longer to execute indicate performance degradation in a connected system or data volume growth - Failure pattern analysis — the same action failing repeatedly across multiple runs points to a systemic issue (connector outage, permission change, data quality problem) rather than a transient error

The retention limitation: Flow run history has a limited retention window — 28 days for most licence tiers. For solutions requiring longer operational history or audit trail, implement custom logging — writing flow execution records (status, duration, records processed, errors) to a Dataverse table or Azure Log Analytics workspace.

Monitor Tool — Real-Time Canvas App Diagnostics

The Power Apps Monitor tool provides real-time diagnostic visibility into canvas app execution — formula evaluations, network requests, connector calls, and error details — as the app is running.

Operational use cases: - Diagnosing intermittent errors that users report but cannot reproduce consistently — Monitor captures the exact execution state at the moment of failure - Performance investigation — identifying which connector calls or formula evaluations are contributing to slow screen load times - Delegation issue diagnosis — confirming whether queries are delegating to the data source or executing locally

Monitor is a development and diagnostic tool — it is not a continuous monitoring solution. It is the right tool for targeted investigation of specific reported issues, not for ongoing operational visibility (Application Insights serves that purpose).


Audit and Change Tracking

Enterprise solutions require the ability to answer two questions reliably: what happened to the data? and who changed the configuration?

Data Audit

Dataverse auditing captures a record-level history of every create, update, and delete operation — who changed what, when, and what the previous value was. This is the operational data trail that support teams use to investigate data quality issues, and the compliance evidence that audit teams require for regulated workloads.

Operational use of data audit: - Investigating reported data anomalies — "this field was correct yesterday and wrong today, what changed it?" - Support case resolution — reconstructing the sequence of events that led to a data quality issue - Compliance response — providing evidence of data handling to regulators or internal audit

The full governance treatment of Dataverse auditing — enabling, retention, and compliance mapping — is covered in D — Data and SHIELD — Enforce. From an operations perspective, the key requirement is that auditing is enabled before go-live and that the operations team knows how to query audit history when needed.

Change Audit and Configuration History

Beyond data changes, enterprise solutions require visibility into configuration changes — who modified a flow, who changed a business rule, who updated a canvas app formula in production.

Solution-level change tracking: All solution changes should flow through the ALM pipeline — meaning every change to a production solution is traceable to a source control commit, a pipeline run, and an approval record. Solutions modified directly in production, outside of the pipeline, produce changes that are not tracked, not reversible, and not attributable. This is both an operations problem and a governance violation.

Environment-level change tracking: The Power Platform Admin Center logs environment-level administrative actions — environment creation and deletion, DLP policy changes, admin role assignments. These logs are accessible to environment and tenant administrators and should be reviewed as part of routine operational oversight.


Failure Alerting — Proactive Detection Before Users Notice

The difference between a well-operated enterprise solution and a poorly operated one is largely determined by who discovers problems first — the operations team or the users. Proactive failure alerting is the mechanism that shifts discovery from reactive to proactive.

Application Insights alerts: Configure alert rules in Application Insights for: - Error rate exceeding a defined threshold (e.g. more than 5% of sessions encountering errors) - Specific exception types that indicate critical failures - Response time degradation beyond acceptable thresholds - Zero activity — no telemetry received for a solution that should be active (indicating the solution or its trigger is not running)

Flow failure alerts: Power Automate sends email notifications to flow owners when flows fail — but this default notification goes to a personal email and depends on the flow owner being the right person to receive it. For enterprise solutions, replace the default notification with a structured alert: - Configure a dedicated shared mailbox or Teams channel as the alert destination - Build a monitoring flow that queries flow run history on a schedule and alerts when failure rates exceed a threshold - For critical flows — implement an explicit failure branch that sends a structured alert with the error context, affected records, and remediation guidance

Suspended flow alerts: Power Automate suspends flows when they encounter repeated failures, connector authentication errors, or licence issues. Suspended flows stop processing entirely — silently, unless someone is monitoring for them. The platform sends an email notification to the flow owner when a flow is suspended, but this notification is easy to miss in a personal inbox.

Enterprise operations practice: configure a Power Automate flow that monitors for suspended flows across the solution's environment and sends a structured alert to the operations team. A suspended flow in a mission-critical process is an incident, not an email.

Dataverse capacity alerts: Configure storage alerts in the Power Platform Admin Center before environments approach capacity limits. Storage overages block solution imports and can prevent record creation — discovering a capacity issue when it causes a production failure is significantly more disruptive than discovering it at 80% utilisation.


Platform-Initiated Signals — Staying Ahead of Change

Microsoft continuously evolves Power Platform — releasing new capabilities, deprecating old ones, issuing security updates, and communicating planned maintenance. Enterprise operations teams that monitor these signals proactively avoid the surprises that organisations who discover changes reactively experience.

Message Center

The Microsoft 365 Message Center is the primary channel through which Microsoft communicates upcoming changes to Power Platform — new features being enabled by default, behaviours being deprecated, required actions, and timeline-sensitive notifications.

Operational discipline for Message Center: - Assign a named owner to monitor Message Center for Power Platform communications — this is not a task that should be distributed across the team without a clear owner - Triage messages by impact: required action vs informational, and by timeline - For deprecation notices — assess impact on existing solutions immediately and plan remediation before the deprecation date, not after - Feed relevant Message Center items into the solution team's backlog — deprecations and breaking changes are maintenance work items, not optional reading

Service Health Dashboard

The Power Platform Service Health dashboard in the Microsoft 365 admin portal provides real-time and historical visibility into platform availability — active incidents, degraded performance, planned maintenance, and post-incident reviews.

Operational use: - When users report intermittent errors or performance issues — check Service Health before beginning solution-level investigation. A platform incident explains many user-reported issues and eliminates unnecessary diagnostic work. - Subscribe to service health notifications for Power Platform services — email or webhook notifications when incidents are raised and resolved - Include platform health status in operational reporting — stakeholders deserve to know when solution issues are caused by platform incidents outside the solution team's control

Power Platform Admin Center Recommendations

The Power Platform Admin Center surfaces automated recommendations — governance gaps, unused environments, capacity forecasts, connector usage anomalies, and security posture observations. These recommendations are generated by Microsoft's platform analytics and represent signals that would otherwise require manual analysis to surface.

Operational practice: Review PPAC recommendations on a monthly cadence as part of the solution health review process. Recommendations are not mandatory — but ignored recommendations accumulate into operational risk. An environment flagged as having no DLP policy is a risk that was visible before an incident; an unused flow running under a departed employee's credentials is a security and operational concern that the recommendation surfaced.

Suspended Flow and Orphaned Resource Notifications

Beyond the suspended flow email notifications described in the failure alerting section, the Power Platform ecosystem generates several categories of automated notification that operations teams must act on:

Flow owner departure notifications: When a user account is disabled or deleted in Azure AD, flows owned by that user are at risk of suspension — connector credentials tied to personal accounts expire, and flows running under the departed user's context may lose permissions. The CoE Starter Kit's orphaned resource detection surfaces these proactively.

Licence-related flow suspension: Flows using premium connectors require premium licences. When a user's licence changes — through licence reassignment, organisation policy change, or licence expiry — dependent flows are suspended. Monitor licence changes in the Microsoft 365 admin portal and assess impact on dependent flows before the change takes effect.

AI Builder credit depletion alerts: Configure alerts in the Power Platform Admin Center for AI Builder credit consumption. AI Builder credits are a shared tenant resource — a widely distributed flow using AI Builder actions can deplete the allocation unexpectedly, blocking all other solutions using AI Builder.


Design for Reduced Support

The most effective support strategy is designing solutions that require less support. Every support request that reaches a human represents a failure of either the solution design or the self-service capability — both of which are addressable in the design phase.

Error Handling and Graceful Degradation

Every integration point, every external dependency, and every complex operation in a solution will fail at some point. The question is not whether it will fail — it is whether the solution handles the failure gracefully or exposes it to the user as an unrecoverable error.

Minimum error handling requirements for enterprise solutions:

In canvas apps: - Use the IfError function to catch formula errors and provide meaningful feedback rather than propagating errors to the user - Validate user input before submission — surface validation errors as inline messages, not system errors - Handle connector timeouts and failures explicitly — show a meaningful message and offer a retry option - Never surface raw system error messages to end users — translate technical errors into human-readable guidance

In cloud flows: - Implement try/catch scope patterns — wrap critical actions in scopes configured to run on failure - On failure: log the error with context (which record, which user, what operation), send a structured alert to the operations team, and where possible, queue the failed item for retry or manual remediation - Design idempotent operations — operations that can be safely retried without producing duplicate results

In plugins: - Catch expected exceptions and throw informative InvalidPluginExecutionException messages — these surface to users as readable error messages rather than generic system errors - Log diagnostic information to the plugin trace log before throwing exceptions — the trace log is often the only forensic evidence available after a failure

Meaningful Error Messages

Generic error messages ("An error occurred. Please try again.") are operationally useless — they tell the user nothing about what to do and the support team nothing about what happened. Every error state in an enterprise solution should have a specific, actionable message:

  • What happened (in plain language, not technical detail)
  • What the user should do next (retry, contact support, provide specific information)
  • A reference code that correlates to a log entry — enabling support teams to find the specific error in telemetry without asking the user for technical details

Configuration Over Code

Operational settings that change without a code change — notification thresholds, recipient email addresses, environment-specific endpoints, feature flags, approval timeouts — should be stored in configuration, not hardcoded in logic.

In Power Platform, configuration belongs in: - Environment variables — solution-aware configuration that varies by environment (Dev/Test/Production) - Dataverse configuration tables — operational settings that business users or operations teams can update without a developer - SharePoint lists or Azure App Configuration — for settings that need to be accessible outside the solution

The operational benefit: when a threshold needs adjusting or a notification recipient changes, the operations team makes the change in configuration without requesting a development cycle and a deployment.


Support Model

Every enterprise solution needs a defined support model — a clear answer to the question "who do users contact when something goes wrong, and what happens next?"

L1 — User Self-Service

The first line of support is no human contact at all. Well-designed solutions include: - In-app help and guidance — contextual help text, tooltips, and guidance panels that answer common questions within the solution itself - Knowledge base and user documentation — accessible from within the solution, not buried in a SharePoint site nobody knows exists - Copilot Studio support agent — a solution-embedded agent that answers common questions, guides users through processes, and escalates to human support when it cannot resolve the issue - Meaningful error messages — as described above, errors that tell users what to do next rather than requiring a support contact

L2 — Solution Support Team

When self-service cannot resolve the issue, the first human contact should be a team with solution-specific knowledge — typically the CoE team, business analysts, or designated solution champions within the business unit: - Trained on the solution's business purpose and common issue patterns - Able to resolve data issues, user access problems, and configuration questions - Empowered to make minor operational changes (adding users, adjusting configuration) without escalating to developers - Equipped with a runbook of known issues and resolutions

L3 — Developer Escalation

Issues that require code changes, plugin investigation, or architecture-level diagnosis escalate to the Solution Engineer or development team: - Access to plugin trace logs, Application Insights, and system job logs - Authority to make and deploy solution changes via the ALM pipeline - Responsible for root cause analysis and permanent fix — not just symptomatic resolution - Owner of the post-incident review and the resulting backlog items

L4 — Platform Team and Microsoft Support

Two distinct escalation paths exist at L4:

Platform / CoE team: Issues that involve platform-level configuration — DLP policies, environment settings, connector approvals, licence allocation — escalate to the platform operations team (CoE Lead / Platform Admin). The solution team should not have direct access to platform-level settings; the platform team is the correct escalation path for platform-level issues.

Microsoft Support: Platform bugs, service incidents, unexpected platform behaviour, and licensing issues escalate to Microsoft via the Power Platform Admin Center support portal. Enterprise agreements typically include defined support response times. Key guidance: - Document the issue thoroughly before raising a support ticket — environment details, reproduction steps, error messages, plugin trace logs, and the timeline of when the issue started - Distinguish between a solution bug (owned by the solution team) and a platform bug (owned by Microsoft) — raising Microsoft support tickets for solution-level issues wastes time and resolution cycles - Subscribe to service health notifications — many "Microsoft support" escalations are platform incidents already being worked on


Incident Management

When a solution failure has material business impact — users cannot complete their work, data has been corrupted, a critical process has stopped — the response must be structured, not ad hoc.

Incident severity levels for solutions:

Severity Definition Response Target
P1 — Critical Solution completely unavailable or data corruption in progress. Material business impact. Immediate — within 15 minutes
P2 — High Major functionality unavailable. Workaround exists but significant impact. Within 1 hour
P3 — Medium Partial functionality degraded. Users can continue with reduced capability. Within 4 hours
P4 — Low Minor issue. No significant impact on business operations. Next business day

Incident response steps: 1. Detect — monitoring alert, user report, or platform notification 2. Assess — determine severity and impact scope 3. Communicate — notify affected users and stakeholders with status and expected resolution timeline 4. Contain — implement immediate mitigation if available (disable a failing flow, route around a broken integration, apply a data fix) 5. Resolve — implement and deploy the permanent fix via the ALM pipeline 6. Review — post-incident review within 48 hours for P1/P2 incidents: what happened, why, what changes prevent recurrence

Post-incident review discipline: Post-incident reviews are the mechanism through which operational incidents become operational improvements. The review should produce specific, assigned backlog items — not general observations. "Improve error handling" is not a backlog item. "Add Application Insights alert for connector timeout errors in the invoice processing flow" is.


Ongoing Maintenance and Updates

Enterprise solutions are not static. The platform evolves, dependencies change, business requirements shift, and Microsoft continuously releases updates that affect solution behaviour. Maintenance is not optional — it is the cost of keeping a solution operational.

Managing Microsoft-Initiated Changes

Power Platform is a continuously updated SaaS platform. Microsoft releases updates that can affect solution behaviour — connector changes, API version deprecations, default behaviour changes, security updates. The Message Center is the primary signal; the operations team's job is to translate signals into action.

Deprecation management: When Microsoft announces a deprecation — a connector version being retired, an API endpoint changing, a feature being removed — assess impact immediately: - Which solutions use the deprecated capability? - What is the remediation path? - What is the deadline? - What is the deployment and testing plan?

Deprecations with multi-year notice windows are frequently treated as low priority until they become urgent. Enterprise operations discipline means tracking deprecations as time-bound backlog items from the moment they are announced.

Connector and API Updates

Connectors are updated by Microsoft and third-party publishers — sometimes with breaking changes. Monitor connector update notifications in the Message Center and test solutions against connector updates in a non-production environment before they reach production.

For solutions using custom connectors or direct HTTP calls to versioned APIs — establish a process for monitoring API version deprecations from the API provider and maintaining the connector definition accordingly.

Platform Feature Updates

New Power Platform features released by Microsoft occasionally require opt-in configuration or impact existing solution behaviour. Review feature update communications from the Message Center and assess: - Does this feature change default behaviour that our solution relies on? - Does this feature provide a capability that replaces a workaround we have built? - Does this feature require configuration to enable or disable for our environment?

Solution Dependency Updates

Solutions that integrate with external systems are dependent on those systems' stability and versioning. Establish a monitoring process for: - External API version changes that affect custom connectors - Third-party connector updates that change action schemas or authentication requirements - Azure service updates that affect Function Apps, Service Bus, or other Azure components in the integration architecture


Solution Health Reviews

Periodic structured assessment of solution health prevents the gradual accumulation of operational debt — the slow degradation of performance, increasing support volume, growing technical debt, and expanding security surface area that is invisible without deliberate review.

Recommended cadence: - Monthly — operational metrics review (error rates, support volume, performance trends, capacity consumption) - Quarterly — technical health review (deprecated components, security model review, licence utilisation, dependency currency) - Annually — strategic review (is the solution still fit for purpose? Does it need significant investment or should it be replaced?)

Health review inputs: - Application Insights dashboards — error rates, performance trends, usage patterns - Support ticket volume and category analysis — what are users struggling with? - Flow run history analysis — failure rates, duration trends - PPAC recommendations — governance and capacity flags - Message Center backlog — outstanding deprecations and required actions - Solution complexity metrics — number of flows, canvas app screens, plugins — complexity growth signals future maintenance overhead


Deprecation and Retirement

Solutions do not live forever. Business processes change, systems are replaced, and solutions built for a specific need become redundant. Unretired solutions that continue running beyond their useful life consume licences, storage, and support capacity — and create security surface area that serves no business purpose.

Retirement triggers: - The business process the solution supports has been replaced or eliminated - The solution has had no active users for a defined period (typically 90 days) - The solution has been superseded by a new solution or platform capability - The maintenance cost of the solution exceeds its business value

Retirement process: 1. Notify — communicate the retirement timeline to all known users and stakeholders with adequate notice (typically 30-90 days depending on solution criticality) 2. Archive — export the solution package and store it in source control for a defined retention period — solutions are sometimes needed for reference or data recovery after retirement 3. Data handling — determine the disposition of solution data: archive to long-term storage, export for reporting, or delete per the data retention policy 4. Decommission — disable and delete the solution, flows, apps, and environment resources 5. Document — record the retirement date, reason, and data disposition in the solution inventory

The CoE Starter Kit inactivity process: The CoE Starter Kit includes automated detection of inactive apps and flows — sending notifications to owners and, after a defined non-response period, archiving or deleting inactive resources. This automates the discovery phase of retirement without requiring manual inventory reviews.


Maturity Levels

Level Description Suitable For
Basic Flow run history reviewed reactively when issues are reported. Basic error messages. No Application Insights. Support handled informally. Personal productivity and low-criticality departmental solutions
Intermediate Application Insights configured. Failure alerts on critical flows. Defined support contact. Incident severity levels documented. Message Center monitored. Team and departmental enterprise solutions with moderate criticality
Advanced Full observability stack — Application Insights, plugin trace logs, system job monitoring, custom flow logging. Proactive alerting across all failure modes. Formal L1-L4 support model. Post-incident reviews. Quarterly health reviews. Deprecation backlog actively managed. All platform signals monitored and actioned. Mission-critical enterprise solutions with formal operational governance

Safe Zone

Solutions with low user counts, non-sensitive data, and low business criticality can operate at Basic maturity with informal support and reactive monitoring.

Any solution that meets one or more of the following must reach Intermediate or Advanced maturity before Go-Live: - Serves more than 50 users - Processes financially, legally, or operationally critical transactions - Has regulatory or compliance obligations - Integrates with mission-critical enterprise systems - Has no viable manual workaround if the solution is unavailable - Handles personally identifiable or sensitive data


Common Mistakes

  • No telemetry until the first major incident — solutions go live without Application Insights, and the first significant failure is diagnosed by guesswork rather than data. Retroactively adding telemetry to a running production solution is harder than building it in from the start.
  • Flow owners as individuals — flows owned by personal accounts fail silently when the person leaves. Use shared service accounts or team-owned connections for all production flows.
  • Suspended flows discovered by users — no monitoring for flow suspension means users discover the failure before the operations team does. A suspended critical flow is a silent incident.
  • Message Center as optional reading — deprecation notices read two weeks before the deprecation date rather than on the day they were published. Remediation work becomes emergency work.
  • No post-incident reviews — incidents resolved without structured review. The same failure pattern occurs three months later because nothing changed.
  • Generic error messages — "An error occurred" messages that tell users nothing and tell support teams nothing. Every error state deserves a specific, actionable message.
  • Hardcoded configuration — notification email addresses, thresholds, and environment-specific settings embedded in flow logic rather than environment variables. Changing them requires a developer and a deployment.
  • No retirement process — solutions running years past their useful life, consuming licences and storage, owned by people who have left, serving users who have moved to other tools. The CoE Starter Kit's inactivity process exists precisely to surface and address this.
  • Treating L4 as L1 — raising Microsoft support tickets for issues that are solution-level bugs rather than platform bugs. Wastes time and obscures the actual problem.
  • No capacity monitoring — Dataverse storage approaching its limit with no alerts configured. The first sign of a capacity issue is a solution import failing or record creation being blocked.

Readiness Checklist

Observability - [ ] Application Insights configured for all production canvas apps and critical flows - [ ] Custom events defined for business-significant actions — not just technical errors - [ ] Plugin trace logging enabled at Exception level for all environments with plugin logic - [ ] System job monitoring process defined — failed job review cadence established - [ ] Flow run history retention strategy decided — custom logging implemented where 28-day retention is insufficient - [ ] Monitor tool familiarisation — operations team knows how to use it for live diagnostics

Alerting - [ ] Application Insights alert rules configured — error rate, performance degradation, zero activity - [ ] Flow failure alerts configured to shared team channel or mailbox — not personal email - [ ] Suspended flow monitoring implemented — not reliant on personal email notifications - [ ] Dataverse capacity alerts configured at 80% threshold - [ ] AI Builder credit consumption alerts configured

Platform Signals - [ ] Message Center owner assigned — named individual responsible for Power Platform communications - [ ] Service Health notifications subscribed — email or webhook for Power Platform services - [ ] PPAC recommendations review scheduled — monthly cadence - [ ] Deprecation backlog established — all known deprecations tracked as time-bound items

Error Handling and Design - [ ] Canvas app error handling implemented — IfError, meaningful messages, retry options - [ ] Flow try/catch scopes implemented on all critical actions - [ ] Plugin exception handling implemented — informative error messages, trace log entries - [ ] Environment variables used for all operational configuration — no hardcoded settings - [ ] Self-service support materials created — in-app help, knowledge base, or support agent

Support Model - [ ] L1 self-service capability designed and tested - [ ] L2 support team identified and trained - [ ] L3 developer escalation path defined - [ ] L4 escalation paths documented — platform team contacts, Microsoft support process - [ ] Support contact information accessible to users within the solution

Incident Management - [ ] Incident severity levels defined for this solution - [ ] Incident response runbook documented - [ ] Post-incident review process established - [ ] Communication template prepared for user-facing incident notifications

Maintenance - [ ] Deprecation management process established — Message Center to backlog workflow - [ ] Connector and API version monitoring process defined - [ ] Solution health review cadence scheduled — monthly operational, quarterly technical

Retirement - [ ] Inactivity monitoring configured — CoE Starter Kit or equivalent - [ ] Retirement process documented — notification, archival, data disposition, decommission - [ ] Data retention policy defined for solution data


Part of the DIALOGE Framework — powerplatform.wiki Last updated: March 2026 Last reviewed: March 2026