Linux system and hardware monitoring made efficient
Whether you're a home user or a system/network administrator at a large site, monitoring your system helps you in ways you possibly do not know yet. For example, you have important work-related documents on your laptop and one fine day, the hard drive decides to die on you without even saying goodbye. Since most users don't make backups, you'll have to call your boss and tell him the latest financial reports are gone. Not nice. But if you used a regularly started (at boot or with cron) disk monitoring and reporting piece of software, like smartd for example, it will tell you when your drive(s) start to become weary. Between us, though, a hard drive may decide to go belly up without warning, so backup your data.
Our article will deal with everything related to system monitoring, whether it's network, disk or temperature. This subject usually can form enough material for a book, but we will try to give you only the most important information in order to get you started, or, depending on experience, have all the info in one place. You are expected to know your hardware and have basic sysadmin skills, but regardless where you're coming from, you'll find something useful here. If you still have some questions after reading this article please try our new LinuxCareer Forum.
2. Temperature monitoring
2.1. Installing the tools
Some "install-everything" distributions may have the package needed for you to monitor the system temperature already there. On other systems, you may need to install it. On Debian or a derivative you can simply do
# aptitude install lm-sensors
On OpenSUSE systems the package is named simply "sensors", while on Fedora you can find it under the name lm_sensors. You can use the search function of your package manager to find sensors, since most distributions offer it.
Now, as long as you have relatively modern hardware, you will probably have temperature monitoring capability. If you use a desktop distribution, you will have hardware monitoring support enabled. If not, or if you roll your own kernels, make sure you go to Device Drivers => Hardware Monitoring section and enable what's needed (mainly CPU and chipset ) for your system.
2.2. Using the tools
After you're sure you have hardware and kernel support, just run the following before using sensors:
[You will get few dialogs about HW detection]
[Here's how it looks on my system:]
Adapter: PCI adapter
Core0 Temp: +32.0°C
Core0 Temp: +33.0°C
Core1 Temp: +29.0°C
Core1 Temp: +25.0°C
Adapter: PCI adapter
temp1: +58.0°C (high = +100.0°C, crit = +120.0°C)
Your BIOS might have (most do) a temperature failsafe option: if the temperature reaches a certain threshold, the system will shutdown in order to prevent damage to the hardware. On the other hand, while on a regular desktop the sensors command might not seem very useful, on server machines located maybe hundreds of kilometers away such a tool might make every difference in the world. If you're the administrator of such systems, we recommend you write a short script that will mail you hourly, for example, with reports and maybe statistics about system temperature.
3. Disk and I/O
In this part we will refer to hardware status monitoring first, then go to the I/O section which will deal with detection of bottlenecks, reads/writes and the like. Let's start with how to get disk health reports from your hard drives.
S.M.A.R.T., which stands for Self Monitoring Analysis and Reporting Technology, is a capability offered by modern hard drives that lets the administrator efficiently monitor disk health. The application to install is usually named smartmontools, which offers a init.d script for regular writing to syslog. Its' name is smartd and you can configure it by editing /etc/smartd.conf and configuring the disks to be monitored and when to be monitored. This suite of S.M.A.R.T. tools works on Linux, the BSDs, Solaris, Darwin and even OS/2. Distributions offer graphical front ends to smartctl, the main application to use when you want to see how your drives are doing, but we will focus on the command line utility. One uses -a (all info) /dev/sda as an argument, for example, to get a detailed report on the status of the first drive installed on the system. Here's what I get:
# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.0-1-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA
Device Model: WDC WD5000AAKS-00WWPA0
Serial Number: WD-WCAYU6160626
LU WWN Device Id: 5 0014ee 158641699
Firmware Version: 01.03B01
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Oct 19 19:01:08 2011 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 138 138 021 Pre-fail Always - 4083
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 369
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4186
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 366
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 347
194 Temperature_Celsius 0x0022 105 098 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
What we can get from this output is, basically, that no errors are reported and that all values are within normal margins. When it comes to temperature, if you have a laptop and you see abnormally high values, consider cleaning the insides of your machine for better air flow. The platters may get deformed because of excessive heat and you certainly don't want that. If you use a desktop machine, you can get a hard drive cooler for a cheap price. Anyway, if your BIOS has that capability, when POSTing it will warn you if the drive is about to fail.
smartctl offers a suite of tests one can perform: you can select what test you want to run with the -t flag:
# smartctl -t long /dev/sda
Depending on the size of the disk and the test you chose, this operation can take quite some time. Some people recommend running tests when the system does not have any significant disk activity, others even recommend using a live CD. Of course these are common sense advices, but in the end all this depends on the situation. Please refer to the smartctl manual page for more useful command-line flags.
If you are working with computers that do lots of read/write operations, like a busy database server, for instance, you will need to check disk activity. Or you want to test the performance your disk(s) offer you, regardless of the purpose of the computer. For the first task we will use iostat, for the second one we'll have a look at bonnie++. These are just two of the applications one can use, but they're popular and do their job quite well, so I felt no need to look elsewhere.
If you don't find iostat on your system, your distribution might have it included in the sysstat package, which offers lots of tools for the Linux administrator, and we'll talk about them a little later. You can run iostat with no arguments, which will give you something like this:
Linux 3.0.0-1-amd64 (debiand1) 10/19/2011 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.14 0.00 3.90 1.21 0.00 89.75
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 18.04 238.91 118.35 26616418 13185205
If you want iostat to run continuously, just use -d (delay) and an integer:
$ iostat -d 1 10
This command will run iostat 10 times at a one second interval. Read the manual page for the rest of the options. It will be worth it, you'll see. After looking at the flags available, one common iostat command may be like
$ iostat -d 1 -x -h
Here -x stands for eXtended statistics and -h is from Human readable output.
bonnie++'s name (the incremented part) comes from its' inheritance, the classic bonnie benchmarking program. It supports lots of hard disk and filesystem tests that stress the machine by writing/reading lots of files. It can be found on most Linux distributions exactly by that name: bonnie++. Now let's see how to use it.
bonnie++ usually gets installed in /usr/sbin, which means that if you are logged in as a normal user (and we recommend it) you will have to type the whole path to start it. Here's some sample output:
Writing a byte at a time...done
Reading a byte at a time...done
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
debiand2 4G 298 97 61516 13 30514 7 1245 97 84190 10 169.8 2
Latency 39856us 1080ms 329ms 27016us 46329us 406ms
Version 1.96 ------Sequential Create------ --------Random Create--------
debiand2 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 14076 34 +++++ +++ 30419 63 26048 59 +++++ +++ 28528 60
Latency 8213us 893us 3036us 298us 2940us 4299us
Please bear in mind that running bonnie++ will stress your machine, so it's a good idea to do this when the system isn't as busy as usual. You can choose the output format (CSV, text, HTML), the destination directory or file size. Again, read the manual, because these programs depend on the underlying hardware and its' usage. Only you know best what you want to get from bonnie++.
4. Network monitoring
Before we start, you should know that we will not deal with network monitoring from a security standpoint, but from a performance and troubleshooting standpoint, although the tools are the same sometimes (wireshark, iptraf, etc.). When you're getting a file with 10 kbps from the NFS server in the other building, you might think about checking your network for bottlenecks. This is a large subject, since it depends on a plethora of factors, like hardware, cables, topology and so on. We will approach the matter in a unified way, meaning you will be shown how to install and how to use the tools, instead of classifying them and getting you all confused with unnecessary theory. We won't include every tool ever written for Linux network monitoring, just what it's considered important.
Before we start talking about complex tools, let's start with the simple ones. Here, the trouble part from troubleshooting refers to network connectivity problems. Other tools, as you will see, refer to attack prevention tools. Again, only the subject of network security spawned many tomes, so this will be as short as it can be.
These simple tools are ping, traceroute, ifconfig and friends. They are usually part of the inetutils or net-tools package (may vary depending on the distribution) and are very probably already installed on your system. Also dnsutils is a package worth installing, as it contains popular applications like dig or nslookup. If you don't already know what these commands do, we recommend you do some reading as they are essential to any Linux user, regardless of the purpose of the computer (s)he uses.
No such chapter in any network troubleshooting/monitoring guide will ever be complete without a part on tcpdump. It is a pretty complex and useful network monitoring tool, whether you're on a small LAN or on a big corporate network. What tcpdump does, basically, is packet monitoring, also known as packet sniffing. You will need root privileges in order to run it, because tcpdump needs the physical interface to run in promiscuous mode, which isn't the default running mode of a Ethernet card. Promiscuous mode means that the NIC will get all traffic on the network, rather than only the traffic intended for it. If you run tcpdump on your machine without any flags, you'll see something like this:
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
20:59:19.157588 IP 192.168.0.105.who > 192.168.0.255.who: UDP, length 132
20:59:19.158064 IP 192.168.0.103.56993 > 192.168.0.1.domain: 65403+ PTR?
20:59:19.251381 IP 192.168.0.1.domain > 192.168.0.103.56993: 65403 NXDomain*
20:59:19.251472 IP 192.168.0.103.47693 > 192.168.0.1.domain: 17586+ PTR?
20:59:19.451383 IP 192.168.0.1.domain > 192.168.0.103.47693: 17586 NXDomain
* 0/1/0 (102)
20:59:19.451479 IP 192.168.0.103.36548 > 192.168.0.1.domain: 5894+ PTR?
20:59:19.651351 IP 192.168.0.1.domain > 192.168.0.103.36548: 5894 NXDomain*
20:59:19.651525 IP 192.168.0.103.60568 > 192.168.0.1.domain: 49875+ PTR?
20:59:19.851389 IP 192.168.0.1.domain > 192.168.0.103.60568: 49875 NXDomain*
20:59:24.163827 ARP, Request who-has 192.168.0.1 tell 192.168.0.103, length 28
20:59:24.164036 ARP, Reply 192.168.0.1 is-at 00:73:44:66:98:32 (oui Unknown), length 46
20:59:27.633003 IP6 fe80::21d:7dff:fee8:8d66.mdns > ff02::fb.mdns: 0 [2q] SRV (QM)?
debiand1._udisks-ssh._tcp.local. SRV (QM)? debiand1 [00:1d:7d:e8:8d:66].
_workstation._tcp.local. (97)20:59:27.633152 IP 192.168.0.103.47153 > 192.168.0.1.domain:
8064+ PTR? b.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.f.f.ip6.arpa. (90)
20:59:27.633534 IP6 fe80::21d:7dff:fee8:8d66.mdns > ff02::fb.mdns: 0*- [0q] 3/0/0
(Cache flush) SRV debiand1.local.:9 0 0, (Cache flush) AAAA fe80::21d:7dff:fee8:8d66,
(Cache flush)SRV debiand1.local.:22 0 0 (162)
20:59:27.731371 IP 192.168.0.1.domain > 192.168.0.103.47153: 8064 NXDomain 0/1/0 (160)
20:59:27.731478 IP 192.168.0.103.46764 > 192.168.0.1.domain: 55230+ PTR?
20:59:27.931334 IP 192.168.0.1.domain > 192.168.0.103.46764: 55230 NXDomain 0/1/0 (160)
20:59:29.402943 IP 192.168.0.105.mdns > 22.214.171.124.mdns: 0 [2q] SRV (QM)?
debiand1._udisks-ssh._tcp.local. SRV (QM)? debiand1 [00:1d:7d:e8:8d:66]._workstation.
20:59:29.403068 IP 192.168.0.103.33129 > 192.168.0.1.domain: 27602+ PTR? 251.0.0.224.
This is taken from a Internet connected computer without much network activity, but on a world-facing HTTP server, for example, you will see traffic flowing faster than you can read it. Now, using tcpdump like shown above is useful, but it would undermine the application's true capabilities. We will not try to replace tcpdump's well written manual page, we'll leave that to you. But before we go on, we recommend you learn some basic networking concepts in order to make sense of tcpdump, like TCP/UDP, payload, packet, header and so on.
One cool feature of tcpdump is the ability to practically capture web pages, done through using -A. Try starting tcpdump like
# tcpdump -vv -A
and go to a webpage. Then come back to the terminal window where tcpdump is executing. You'll see many interesting things about that website, like what OS the webserver is running or what PHP version was used to create the page. Use -i to specify the interface to listen on (like eth0, eth1, and so on) or -p for not using the NIC in promiscuous mode, useful in some situations. You can save the output to a file with -w $file if you need to check on it later (remember that the file will contain raw output). So an example of tcpdump usage based on what you read below would be
# tcpdump -vv -A -i eth0 -w outputfile
We must remind you that this tool and others, like nmap, snort or wireshark, while they can be useful for monitoring your network for rogue applications and users, it can also be useful to rogue users. Please don't use such tools for malicious purposes.
If you need a cooler interface to a sniffing/analyzing program, you might try iptraf (CLI) or wireshark (GTK). We will not discuss them in more detail, because the functionality they offer is similar to tcpdump. We recommend tcpdump, though, because it's almost certain you'll find it installed regardless of distribution, and it will give you the chance to learn.
netstat is another useful tool for live remote and local connections, which prints its output in a more organized, table-like manner. The name of the package will usually be simply netstat and most distributions offer it. If you start netstat without arguments, it will print a list of open sockets and then exit. But since it's a versatile tool, you can control what to see depending on what you need. First of all, -c will help you if you need continuous output, similar to tcpdump. From here on, every aspect of the Linux networking subsystem can be included in netstat's output: routes with -r, interfaces with -i, protocols (--protocol=$family for certain choices, like unix, inet, ipx...), -l if you want only listening sockets or -e for extended info. The defaults columns displayed are active connections, receive queue, send queue, local and foreign addresses, state, user, PID/name, socket type, socket state or path. These are only the most interesting pieces of information netstat displays, but not the only ones. As usual, refer to the manual page.
The last utility we'll talk about in the network section is nmap. Its' name comes from Network Mapper and it's useful as a network/port scanner, invaluable for network audits. It can be used on remote hosts as well as on local ones. If you want to see which hosts are alive on a class C network, you will simply type
$ nmap 192.168.0/24
and it will return something like
Starting Nmap 5.21 ( http://nmap.org ) at 2011-10-19 22:07 EEST
Nmap scan report for 192.168.0.1
Host is up (0.0065s latency).
Not shown: 998 closed ports
PORT STATE SERVICE
23/tcp open telnet
80/tcp open http
Nmap scan report for 192.168.0.102
Host is up (0.00046s latency).
Not shown: 999 closed ports
PORT STATE SERVICE
22/tcp open ssh
Nmap scan report for 192.168.0.103
Host is up (0.00049s latency).
Not shown: 999 closed ports
PORT STATE SERVICE
22/tcp open ssh
What we can learn from this short example: nmap supports CIDR notations for scanning entire (sub)networks, it's fast and by default it displays the IP address and any open ports of every host. If we would have wanted to scan just a portion of the network, say IPs from 20 to 30, we would have written
$ nmap 192.168.0.20-30
This is the simplest possible use of nmap. It can scan hosts for operating system version, script and traceroute (with -A) or use different scanning techniques, like UDP, TCP SYN or ACK. It also can try to pass firewalls or IDS, do MAC spoofing and all kinds of neat tricks. There are lots of things this tool can do, and all of them are documented in the manual page. Please remember that some (most) administrators don't like it very much when someone is scanning their network, so don't get yourself in trouble. The nmap developers have put up a host, scanme.nmap.org, with the sole purpose of testing various options. Let's try to find what OS it's running in a verbose manner (for advanced options you'll need root):
# nmap -A -v scanme.nmap.org
NSE: Script Scanning completed.
Nmap scan report for scanme.nmap.org (126.96.36.199)
Host is up (0.21s latency).
Not shown: 995 closed ports
PORT STATE SERVICE VERSION
22/tcp open ssh OpenSSH 5.3p1 Debian 3ubuntu7 (protocol 2.0)
| ssh-hostkey: 1024 8d:60:f1:7c:ca:b7:3d:0a:d6:67:54:9d:69:d9:b9:dd (DSA)
|_2048 79:f8:09:ac:d4:e2:32:42:10:49:d3:bd:20:82:85:ec (RSA)
80/tcp open http Apache httpd 2.2.14 ((Ubuntu))
|_html-title: Go ahead and ScanMe!
135/tcp filtered msrpc
139/tcp filtered netbios-ssn
445/tcp filtered microsoft-ds
OS fingerprint not ideal because: Host distance (14 network hops) is greater than five
No OS matches for host
Uptime guess: 19.574 days (since Fri Sep 30 08:34:53 2011)
Network Distance: 14 hops
TCP Sequence Prediction: Difficulty=205 (Good luck!)
IP ID Sequence Generation: All zeros
Service Info: OS: Linux
[traceroute output supressed]
We recommend you also take a look at netcat, snort or aircrack-ng. Like we said, our list is by no means exhaustive.
5. System monitoring
Let's say you see your system starting to have intense HDD activity and you're only playing Nethack on it. You'll probably want to see what's happening. Or maybe you installed a new web server and you want to see how well it fares. This part is for you. Just like in the networking section, there are lots of tools, graphical or CLI, that will help you keep in touch with the state of the machines you're administering. We will not talk about the graphical tools, like gnome-system-monitor, because X installed on a server, where these tools are often used, doesn't really make sense.
The first system monitoring utility is a personal favorite and a small utility used by sysadmins around the world. It's called 'top'.
On Debian systems, top can be found in the procps package. It's usually already installed on your system. It's a process viewer (there is also htop, a more eye-pleasing variant) and, as you can see, it gives you every information you need when you want to see what's running on your system: process, PID, user, state, time, CPU usage and so on. I usually start top with -d 1, which means that it should run and refresh every second (running top without options sets the delay value to three). Once top is started, pressing certain keys will help you order the data in various ways: pressing 1 will show the usage of all CPUs, provided you use a SMP machine and kernel, P orders listed processes after CPU usage, M after memory usage and so on. If you want to run top a specific number of times, use -n $number. The manpage will give you access to all the options, of course.
While top helps you monitor the memory usage of the system, there are other applications specifically written for this purpose. Two of those are free and vmstat (virtual memory status). We usually use free only with the -m flag (megabytes), and its' output looks like this:
total used free shared buffers cached
Mem: 2012 1913 98 0 9 679
-/+ buffers/cache: 1224 787
Swap: 2440 256 2184
vmstat output is more complete, as it will also show you I/O and CPU statistics, among others. Both free and vmstat are also part of the procps package, at least on Debian systems. But when it comes to process monitoring, the most used tool is ps, part of the procps package as well. It can be completed with pstree, part of psmisc, which shows all the processes in a tree-like structure. Some of ps' most used flags include -a (all processes with tty), -x (complementary to -a, see the manual page for BSD-styles), -u (user-oriented format) and -f (forest-like output). These are format modifiers only, not options in the classical sense. Here the use of the man page is mandatory, because ps is a tool you will use often.
Other system monitoring tools include uptime (the name is kinda self explanatory), who (for a listing of the logged-in users), lsof (list open files) or sar, part of the sysstat package, for listing activity counters.
As said before, the list of utilities presented here is by no means exhaustive. Our intention was to put together an article that explains major monitoring tools for everyday use. This will not replace reading and working with real-life systems for a complete understanding of the matter.