Monitoring

Turnkey solution : Netdata
Declarative and lightweight solution : Monit
Web server monitoring

I was searching for a lightweight monitoring solution for my single freebsd home server.
It should be able to get storage, cpu and memory metrics, look at file changes, check up my services and send alerts.

I took a look at monitoring stacks like Prometheus-Node_exporter-Grafana or Telegraf-InfluxDB-Chronograf-Kapacitor. It looked like to me a rabbit hole, grafana dashboard is pretty, it does tons of things, but it can’t handle simplest of my needs. You still need AlertManager to send mails with your prometheus stack. Endless microservices chain…
Those are great solutions when managing large node/service farms.

Turnkey solution : Netdata

Netdata provides a nice dashboard with realtime metrics, and it supervises system health. It can raise a lot of alerts by default and mail it.

netdata

I had to configure DragonFly Mail Agent which is a small Mail Transfert Agent with SMTP authentication over TLS/SSL.

/etc/dma/dma.conf

SMARTHOST smtp.domain.com
PORT 587
AUTHPATH /etc/dma/auth.conf
SECURETRANSFER
STARTTLS
MAILNAME eoli3n.eu.org

/etc/dma/auth.conf

mail@domain.com|smtp.domain.com:P@ssw0rd!

Let’s configure mail forwarding from root to external email

/etc/aliases

root: jonathan.kirszling@runbox.com

I can now test netdata mail transfert by running a test script.

$ /usr/local/libexec/netdata/plugins.d/alarm-notify.sh test

Out of the box, netdata checks metrics every seconds, and stores it to RAM with 2 days of retention. With htop, I have noticed 1.7% of CPU and 0.7% of RAM.
Following recommendations for performance, netdata dropped to 0.0% of CPU, 0.4% of RAM, with 2 weeks of metrics retention.

/usr/local/etc/netdata/netdata.conf

 [global]
+    memory mode = dbengine
+    page cache size = 32
+    dbengine multihost disk space = 256
-    history = 86400
+    update every = 5
+    debug log = none
+    error log = none
+    access log = none

 [plugins]
     freebsd = yes

 [web]
     respect do not track policy = yes
     disconnect idle clients after seconds = 3600
     bind to = 127.0.0.1
     web files owner = netdata
     web files group = netdata

Netdata seems to be clever, it checks a lots of things, but I would like a more declarative solution, to check and alert anything I need.

Declarative and lightweight solution : Monit

Monit is a small utility for managing and monitoring processes, files, directories, filesystems, programs, scripts, hosts, system metrics… It conducts automatic maintenance and repair if you ask it to. It also embeed a clean WebUI to keep an eye on all monitored services.

Simply install monit through your package manager and start writing your monitrc file.
Here the jinja template I wrote for my server, explained in comments. I don’t even need to over-comment it, because the DSL syntax is human readable.

set log /var/log/monit.log

# Check every 30 seconds and delay 120s at start
set daemon 30
    with start delay 120

# Enable WebUI and configure it
set httpd
    port 8080
    use address 127.0.0.1
    allow localhost
    signature disable

# Recipient email for alerts
set alert 

# Configure SMTP server to use to send alert.
# Monit doesn't use system sendmail command.
set mailserver  port 
    username "" password "" using ssl

set mail-format {
    from: Monit <monit@>
    reply-to: noreply@
    subject: $ACTION $SERVICE
    message:
Date:        $DATE
Service:     $SERVICE
Event:       $EVENT
Action:      $ACTION
Description: $DESCRIPTION.
}

# Zpool health
check program zpool-status with path "/sbin/zpool status -x"
    if status != 0 then alert

# Check zpool usage with a custom script
check program usage-zroot with path "/tmp/zpool_usage.sh zroot"
    if status != 0 then alert
check program usage-dpool with path "/tmp/zpool_usage.sh dpool"
    if status != 0 then alert
check program scrub-zroot with path "/tmp/zpool_scrub.sh zroot"
    if status != 0 then alert
check program scrub-dpool with path "/tmp/zpool_scrub.sh dpool"
    if status != 0 then alert

# Resources
check system localhost
    if memory usage > 85% for 3 cycles then alert
    if loadavg (15min) > 4 then alert
    if cpu usage > 85% for 3 cycles then alert
    if swap usage > 25% for 3 cycles then alert

# SSHD
check process sshd with pidfile /var/run/sshd.pid
    start program = "/usr/sbin/service sshd start"
    stop program = "/usr/sbin/service sshd stop"
    if failed port  protocol ssh then restart
    if changed pid then alert
check file sshd_config path /etc/ssh/sshd_config
    if changed md5 checksum then alert

check file passwd path /etc/passwd
    if changed md5 checksum then alert

# Jails
check program jails with path "/usr/local/bin/bastille list -a"
    if content == "Down" then alert

# Updates
check program updates with path "/usr/bin/awk '/packages to be upgraded/ {v+=$NF}END{print v;if (v>=5) exit 1}' /tmp/check-updates"
    if status == 1 then alert

# Audit
check program audit with path "/usr/bin/awk '/found/ {v+=$1}END{print v;if (v>=5) exit 1}' /tmp/bastille-audit"
    if status == 1 then alert

# Web
check process nginx with pidfile /usr/local/bastille/jails/nginx/root/var/run/nginx.pid
    start program = "/usr/local/bin/bastille start nginx"
    stop program  = "/usr/local/bin/bastille stop nginx"
    if failed host photos.eoli3n.eu.org port 443 protocol https content = "Photos" then alert
    if failed host eoli3n.eu.org port 443 protocol https content = "… Blog …" then alert

# DNS
check process nsd with pidfile /usr/local/bastille/jails/nsd/root/var/run/nsd/nsd.pid
    start program = "/usr/local/bin/bastille start nsd"
    stop program  = "/usr/local/bin/bastille stop nsd"
    if failed host  port 53 use type udp protocol dns then restart
    if changed pid then alert

# Syncthing
check process syncthing with pidfile /usr/local/bastille/jails/syncthing/root/var/run/syncthing.pid
    start program = "/usr/local/bin/bastille start syncthing"
    stop program  = "/usr/local/bin/bastille stop syncthing"
    if changed pid then alert

# Backups
check directory backup-host1 path /data/zfs/backups/host1
   if timestamp > 24 hour then alert
check directory backup-host2 path /data/zfs/backups/host2
   if timestamp > 24 hour then alert

# Snapshots
check directory zfs-snapshots-slash path /.zfs/snapshot
    if timestamp > 2 hours then alert

# Sigal
check file sigal-check path /tmp/sigal-check
    if changed md5 checksum then alert

# Jekyll
check directory jekyll path /data/zfs/www/blog
   if changed timestamp then alert

zpool_usage.sh

#!/bin/sh

echo "cap: $(zpool list -o cap -H $1) used: $(zpool list -o alloc -H $1) free: $(zpool list -o free -H $1) size: $(zpool list -o size -H $1)"
cap="$(zpool list -o cap -H $1 | tr -d '%')"
if [ "$cap" -gt 90 ]
then
    exit 1
fi

zpool_scrub.sh

#!/bin/sh

scrub_expire="3110400" # 36days * 24hours * 60minutes * 60seconds
current_date="$(/bin/date +%s)"
scrub="$(/sbin/zpool status $1 | grep scrub | awk '{print $15 $12 $13}')"
scrub_date="$(date -j -f '%Y%b%e-%H%M%S' $scrub'-000000' +%s)"

echo "$(/sbin/zpool status $1 | grep scrub)"

if [ $(($current_date - $scrub_date)) -ge $scrub_expire ]
then
    exit 1
fi

Then run monit, and check http://localhost:8080. You will now receive a mail when a test fails !

monit

Web server monitoring

Next step is to monitor my web server. Netdata provides a way to get real-time stats from /nginx_status.
If you don’t use Netdata, monitorix could be a good alternative.

Both of those are realtime, I prefer a solution with historization, which would parse access.log to produce some graphs.

GoAccess can parse logs and do it realtime though its own websocket server, developed by the same guy. The tool is impressive, it can be used cli or generate a html report !

To be able to consult the html report, I choose to generate it statically everyday with a cron and spread it with Syncthing.

- name: generate reports everyday at midnight
  cron:
    name: goaccess report
    job: '/usr/local/bin/goaccess /usr/local/bastille/jails/nginx/root/var/log/nginx/access.log --log-format=COMBINED -o /data/zfs/sync/docs/reports/nginx-goaccess.html'
    hour: '0'
    minute: '0'
    user: root

From any client which has the file synced.

$ firefox ~/share/docs/reports/nginx-goaccess.html

goaccess

The site is responsive, I can also consult it with my smartphone.
To finish, let’s add a monit service to check that the report is well generated everyday.

# GoAccess
check file goaccess path /data/zfs/sync/docs/reports/nginx-goaccess.html
    if timestamp > 24 hours then alert

The loop is closed.