Job Description
- Synthetic monitor failures –
- If certain functionality on the site isn’t performing well or is down/degraded, reference the associated runbook to remediate. If a runbook doesn’t exist or doesn’t address the current issue, an escalation is required.
- Website monitor failures –
- If the website is unreachable, the NOC resource must follow our runbook to assess the situation. Escalation is also required.
- Server health monitor failures – Response to CPU, disk, memory-based alerts. Reference the associated runbook to remediate, or escalate if a runbook doesn’t exist or doesn’t address the current issue.
- Network monitor failures –
- Network alerts must be escalated.
- Continuous monitoring improvement –
- Postmortem for incidents as well as monitoring audits will identify if monitoring improvements can be made. Creation of new monitors, tuning of existing monitors.
- Monitoring data review –
- Determine if there are gaps in log data that need to be addressed as part of the continuous monitoring improvement.
- Documentation –
- As new services, processes are launched, we need to make sure we have runbooks in place to support them. The same goes for existing runbooks, continuous refinement.
Job Requirements
- Bachelors in computer science or equivalent work experience.
- Minimum 1+ years experience administering multiple monitoring systems such as SCOM, SolarWinds, Nimsoft, PRTG, etc.
- Knowledge of Windows, Linux, Database & Network Infrastructure
- Ability to work a variety of different shifts, including days, nights, weekends, and holidays to support a 24X7X365 environment.
- Shows initiative and has a strong desire to share knowledge with others.
- Attention to detail maintains high-quality work while handling a large volume of alerts