The following example expressions demonstrate some of the
capabilities of the inference engine.
The directory $PCP_DEMOS_DIR/pmie contains a number of other
annotated examples of pmie
expressions.
The variable delta controls expression evaluation frequency.
Specify that subsequent expressions be evaluated once a second,
until further notice:
delta = 1 sec;
If the total context switch rate exceeds 10000 per second per
CPU, then display an alarm notifier:
kernel.all.pswitch / hinv.ncpu > 10000 count/sec
-> alarm "high context switch rate %v";
If the high context switch rate is sustained for 10 consecutive
samples, then launch top(1) in an xterm
(1) window to monitor
processes, but do this at most once every 5 minutes:
all_sample (
kernel.all.pswitch @0..9 > 10 Kcount/sec * hinv.ncpu
) -> shell 5 min "xterm -e 'top'";
The following rules are evaluated once every 20 seconds:
delta = 20 sec;
If any disk is performing more than 60 I/Os per second, then
print a message identifying the busy disk to standard output and
launch dkvis
(1):
some_inst (
disk.dev.total > 60 count/sec
) -> print "busy disks:" " %i" &
shell 5 min "dkvis";
Refine the preceding rule to apply only between the hours of 9am
and 5pm, and to require 3 of 4 consecutive samples to exceed the
threshold before executing the action:
$hour >= 9 && $hour <= 17 &&
some_inst (
75 %_sample (
disk.dev.total @0..3 > 60 count/sec
)
) -> print "disks busy for 20 sec:" " [%h]%i";
The following two rules are evaluated once every 10 minutes:
delta = 10 min;
If either the / or the /usr filesystem is more than 95% full,
display an alarm popup, but not if it has already been displayed
during the last 4 hours:
filesys.free #'/dev/root' /
filesys.capacity #'/dev/root' < 0.05
-> alarm 4 hour "root filesystem (almost) full";
filesys.free #'/dev/usr' /
filesys.capacity #'/dev/usr' < 0.05
-> alarm 4 hour "/usr filesystem (almost) full";
The following rule requires a machine that supports the lmsensors
metrics. If the machine environment temperature rises more than
2 degrees over a 10 minute interval, write an entry in the system
log:
lmsensors.coretemp_isa.temp1 @0 - lmsensors.coretemp_isa.temp1 @1 > 2
-> alarm "temperature rising fast" &
syslog "machine room temperature rise alarm";
And something interesting if you have performance problems with
your Oracle database:
// back to 30sec evaluations
delta = 30 sec;
sid = "ptg1"; # $ORACLE_SID setting
lid = "223"; # latch ID from v$latch
lru = "#'$sid/$lid cache buffers lru chain'";
host = ":moomba.melbourne.sgi.com";
gets = "oracle.latch.gets $host $lru";
total = "oracle.latch.gets $host $lru +
oracle.latch.misses $host $lru +
oracle.latch.immisses $host $lru";
$total > 100 && $gets / $total < 0.2
-> alarm "high lru latch contention in database $sid";
The following ruleset
will emit exactly one message depending on
the availability and value of the 1-minute load average.
delta = 1 minute;
ruleset
kernel.all.load #'1 minute' > 10 * hinv.ncpu ->
print "extreme load average %v"
else kernel.all.load #'1 minute' > 2 * hinv.ncpu ->
print "moderate load average %v"
unknown ->
print "load average unavailable"
otherwise ->
print "load average OK"
;
The following rule will emit a message when some filesystem is
more than 75% full and is filling at a rate that if sustained
would fill the filesystem to 100% in less than 30 minutes.
some_inst (
100 * filesys.used / filesys.capacity > 75 &&
filesys.used + 30min * (rate filesys.used) > filesys.capacity
) -> print "filesystem will be full within 30 mins:" " %i";
If the metric mypmda.errors counts errors then the following rule
will emit a message if the rate of errors exceeds 1 per second
provided the error count is less than 100.
mypmda.errors > 1 && instant mypmda.errors < 100
-> print "high error rate: %v";