Michael W Lucas' book: SSH Mastery: OpenSSH, Putty, Tunnels and Keys. It is good enough that I avoided buying the book, even when it was released with funding support my favourite Open Source project (OpenBSD with OpenSSH.)

I was under some insane self delusion that I didn't want to be bound by the books research, so that I can ethically 'document' my own stumbles into SSH to share freely with others. Fortunately, the better solution, for users and HR administrators of System Administrators, is to just buy this book.

After recieving a blogger review copy of Michael's book, the 1st thing I did was to hit the corporate buy button to order a legitimate print/e-book copy for my cohort, fellow sysadmin. Why?

What value is there in this book:

  • The Guru in the room
  • Saving Money
  • Saving Time

The Guru in the room

We don't know what we don't know.

The fastest path of learning I've enjoyed has been as the new kid amongst zen masters who danced on their keyboards. Unfortunately the masters moved on and we graduate a little higher up the ladder until we've reached the peak of our incompetence.

The book is a good reference source, with fine examples for many features, and like the zen masters, some of the answers is in the 'debug' sections, how to determine whether what you think you should get, is how SSH is seeing it.

Online articles are often short, make assumptions about how OpenSSH/Putty works, 'script' a lot of commands that require version X.Y of this and M.N of that. Rarely are there supportive notes on how to diagnose the instructions, or related system has response.

SSH Mastery explores, explains, provides samples, provides debugging techniques so we can explore, understand, type-in the SSH commands to see all those features at work. Not the guru in the room, but the next best thing, someone knowledgeable to go to.

Saving me money?

4 years ago I was locking down a machine in the USA (from Australia.) I'd spent a month configuring some complicated Mail Processing system on that box, and was almost ready for the 'live' output. The only thing left to do was formalise the lock down of the machine.

2 minutes later, I'd locked myself out with a typo in my ssh server configuration. After ripping my hair out, I found the answer (documented in Chapter 3) and published it online and @serverfault.com

  • Chapter 3: The OpenSSH Server

Leading off the book (after the general introduction to the topic, data encryption) seems at first odd.

I was hoping for a dive-in to all the magical command-line tricks to flex my authoritarian prowess. But for a system administrator's book, it is understated how critical it is to configure your server correctly, and how to validate the server is working correctly: debug

Saving me time.

There's a lot out there about OpenSSH that we all Bing/Google when the need arises. The one big item that I'm always referring to is tunneling.

For 5 years I worked on machines hidden behind layers of locked away networks requiring multiple hops (log onto one machine, and log from there onto another machine, then to log onto the machine I actually need to work on.)

  • My Machine connects to
  • Bastion machine to connect to
  • Destination machine

Where the above "Bastion" machine may be 2 or more intermediate machines.

Saving me.

But there's more to SSH than system administration, and there are often tight spots where SSH can actually save you.

I was in Tonga over the Christmas break when I needed to do some funds transfers on some accounts in Australia, but the internet awareness/security doesn't allow any transfers from an IP Address from Tonga.

Thanks to OpenSSH, Putty and socket routing,

Refer to other reviews on the web for the utility of this title, including user reviews @amazon

Title: SSH Mastery: OpenSSH, PuTTY, Tunnels and Keys

Author: Michael W. Lucas

Publisher: Tilted Windmill Press (January 18, 2012)

It bothered me enough that I need to record it, and hopefully the path to a solution that others will follow.

(delivery temporarily suspended: Server certificate not verified)

Lesson: Document things properly, especially if it's something interesting, more so if the technology/thing you're doing is normally not what you do, and it's already taken you a long while to get it working properly in the first place.

Mind you, the above may be a difficult task when rushed to get a system out and the only way to confirm the installation is to break it apart and start from scratch

Scenario:

We exchange e-mail with an external organisation (duh!!) with regulatory standards that requires us to ensure e-mail sent to them is encrypted. We achieve this through the following:

  1. Certify that the server we're connecting to is theirs by using:
    • using SSL certificates
    • smtp_tls_policy_maps and
    • fingerprinting
  2. Encrypt the traffic between the two sites using TLS

So, we follow the online Postfix TLS Support and smtpd_tls_fingerprint documentation and have it up and running with the basic configuration:

File extract: /etc/postfix/main.cf

smtp_tls_policy_maps = hash:/etc/postfix/tls_policy

File extract: /etc/postfix/tls_policy

example.com    fingerprint
    fingerprint-digest-is-here

Problem:

External Organisation used a 1 year self-sign certificate, it expires (as most eventually do) and no messages go through them. We get the below "cryptic" message in our logs:

(delivery temporarily suspended: Server certificate not verified)

Answer:

Seems easy enough, we just need to re-do/fix our 1st step above for Certifying the connection.

  1. Get updated certificate from remote site
  2. Update the fingerprint

Load up the online documentation and follow it through.

Oooops, it doesn't work.

The logs laugh: /var/log/maillog

(delivery temporarily suspended: Server certificate not verified)
  1. The message is not sent (deferred) with the error message "Server Certificate not verified".
  2. The message is never sent, since the Server Certificate is never validated.
  3. Bypass certification and send e-mail. The short-term configuration is to not require the fingerprint to be 'certified'.

I'm sure I followed the steps correctly ... (wrong)

Solution:

Walk away from the documentation for a while, walk through it again with the presumption that you've screwed everything up so you need to take all your knowledge and check the basics (verify assumptions) as you go along.

  • digest format
  • fingerprint
  • policy file
Digest Format

[smtp_tls_fingerprint_digest]

Verification of an SMTP server certificate fingerprints, uses a message digest.

Don't get trapped putting together fingerprints that are invalid, or unnecessary. Find out which fingerprint digest is supported by your configuration, and use that.

postconf | grep fingerprint
lmtp_tls_fingerprint_digest = md5
smtp_tls_fingerprint_digest = md5
smtpd_tls_fingerprint_digest = md5

The above configuration output shows we're using the MD5 digest format. It should be fine, but read the documentation about what it says may be the better choice digest for you.

Fingerprint

[Ref: openssl x509]

After acquiring getting your SSL Certificate through some 'trusted' method, generate the fingerprint for the 'trusted' certificate in the following method.

openssl x509 -noout -fingerprint -md5 -in /etc/ssl/certs/example.pem
MD5 Fingerprint=fingerprint-digest-is-here

After comparing the above fingerprint-digest-is-here with what I have in the tls_policy file, it is obvious they don't look anything similar.

Policy File

With the above fingerprint, and digest we can fix the TLS Policy table such as the below:

example.com    fingerprint
    fingerprint-digest-is-here

Remap the file to make sure the correct hashed version is active:

# postmap /etc/postfix/tls_policy

Restart the server and things are coool.

postfix reload

But isn't that what the Postfix documentation says you have to do?

I guess it does, but for some reason the steps I took those days weren't the correct steps. And now that I've rehashed the already hashed, I hopefully will not mis-read the documentation the next time through.

TLS and Postfix

30 June 2011

Upgrading some of our Mail Servers to support for TLS (Transport Layer Security) in Postfix and apart from learning how to do it, also learned a key maxim of programmers (readily applicable to system administrators)

DO NOT PRE-OPTIMISE

Wasted two days of my life, with increased anxiety during the install, configuration process because I was trying to be too smart too early.

After a Duhhh moment, I went back to the very beginning of the install process, and did everything as per the known guides (without that little tweak I had preconceived, and the install worked in less than an 1 hour)

My failure? I got too far ahead of myself, with bright ideas, untested of how I wanted things to work, and started modifying my plans (and solidifying assumptions about how things will work) before collecting evidence for that the assumptions for each stage, were valid.

My idea was for the TLS roll-out on 5 different servers (all requiring SSL certificates) could all use one Certificate Authority. I'd made self-signed certificates before, so presumed/guessed at an approach for one centralised Certificate Authority. Unfortunately, instead of verifying my assumptions of how that can be done, I steam-rolled ahead ass-uming some minor modifications to the process would just work.

  1. Create Certificate Authority (CA) key
  2. Create Certificate Signing Request (CSR) for the host
  3. Create a Certificate (CRT) from the CSR, signed by my new the CA key

The install failed, but gave error messages hinting at problems with the key created in my step #2, or the certificate created in step #3. After agonising through different diagnostic processes from the various error messages. It took 2 whole days to throw away the assumption that caused the error, my change in how I was generating (or using a Certificate Authority.) Arggghhhh!!!

I had been blindly looking at various avenues for why Step #2 or Step #3 were not working correctly, including trying stupid hints from random websites.

The error that Postfix was throwing up said that:

File extract: /var/log/maillog

warning: cannot get RSA private key from file /etc/ssl/private/server.key.pem:disabling TLS support
warning: TLS library problem: xxxxxx certificates routine xxxx key values mismatch xxxxx src/crypto/x509/x509_cmp.c:318:
  1. Can't read the Key
  2. There is no match between the key and certificate

OK, the key file is there, I can see it in the file system. I can open it up with openssl and verify that it is a valid key file by using:

sudo openssl rsa -noout -text -in /path-to/private/server.key.pem

I could even validate that the signed certificate is a valid certificate, likewise the Certificate Authority certificate (so far as our current understanding tells us.)

sudo openssl req -noout -text -in /path-to/server.crt.pem
sudo openssl req -noout -text -in /path-to/private/ca.crt.pem

I blissfully ignore the 2nd error message until I could resolve why my Postfix server was complaining about the Server Key. The assumption, it's probably an 'artifact,' an error caused by the previous error (can't open the key.) We find all sorts of "solutions" on the web, which may work on other OS's, but irrelevant for our OpenBSD install (most related to using 'openssl rsa -in server.key.pem -out server.key.rsa.pem to make sure that the key file is not password protected ?) Not relevant for our OpenBSD install.

It was well into the third day before I found references to verifying that a certificate is created from a key.

$ sudo openssl rsa -noout -text -in /path-to/private/server.key.pem -modulus \
    | grep ^Modulus | openssl md5
$ sudo openssl x509 -noout -text -in /path-to/server.crt.pem -modulus \
    | grep ^Modulus | openssl md5

The use of "| openssl md5" just simplifies the comparison of the Modulus values which are supposed to be the same if they are paired (i.e. certificate was generated from the key.) There's also the requirement that both "public exponent" are equal but the above Modulus comparison is a quick verification process.

OK, I'm running the above command line on my self-signed certificate, and server key. The Modulus DO NOT MATCH.

What?? That doesn't make sense?

I wander through comparisons of all the key & certificate pairs, to find out that the Modulus for my designated CA Key, matches with the Self-Signed Certificate.

What?? That doesn't make sense?

Obviously (duhh) there must be something wrong with my signing process. We trace back our implementation steps and re-do, re-test.

  • Step #3. No that didn't work. No, don't repeat it again. Go back to
  • Step #2 then #3. No that didn't work. No, don't repeat it again. Go back to
  • Step #1 then #2, then #3. No that didn't work.
OK, something is seriously wrong!!!

The 2nd error (and quick perusal into the source code) definitely indicates that the key file is not related to the certificate. Our Modulus investigations above shows that the key/certificate pairs are not created correctly. Could my CA ideas be the cause of my install failures?

Throw that assumption away and create certificates how you've always done it.

  • Step #2 Sign the CSR using the Server Key.

Normal self-signed instructions always use the same key for the CA as well as the Server.

5 minutes later, we have Postfix TLS working as expected, and our documentation is complete. Postfix TLS without dovecot, without cyrus-sasl, woohoo, too easy.

Now to verify that TLS actually encrypts ?

As networks continue to grow, sometimes against our wishes, sometimes with our full support, it becomes more important to get some overview of how and what is moving across your network(s.)

In the beginning, in a land far away, we only had a few machines wired up and life was simple.

Now, most of us have too many machines with an unknown quantity of malware pounding on them (and subsequently on your network.) That's before we even get to our beloved users.

If you get blamed when things go bad on your network, it's time you started taking charge of knowing what's going across your network. Michael W. Lucas' published an insightful book to help us with that Network Flow Analysis. More importantly, for us, is that he chose to describe the solution using tools accessible to everyone (aka Open Source.) We've finally cleaned up some internal notes for getting the software to work well in our favourite os (tm) OpenBSD

These notes augment the installation instructions from that book. Where the human factor is important, in customisation/localisation, interpretation, we don't do any of that here.

Buy the book.

Now you're back, follow through to find out how we put it together for Netflow with flow-tools

It's saved our bacon a number of times, we know who's packets are causing congestion, what times congestions occur, why things occur. AND, we can print out those meaningless charts that senior dweebs nod their heads and just love.

Michael W. Lucas has some war stories where traffic flow monitoring has helped him out, and we can attest to it's daily, weekly value.

Our notes on Netflow with flow-tools

Every now and then people ask how they should partition their hard disk, this doesn't answer that question, but gives some view on how much disk space is used up on a bare system built for compiling OpenBSD from source.

Reference OpenBSD 4.9 i386, FAQ 5

The following is a summary of disk space used on a bare install built for and after compiling OpenBSD 4.9 i386. No packages installed.

path Used More Info
/etc 60M Bare install, no modifications
/usr 6.0G STABLE source extracted to src, xenocara, and compiled using ./obj, ./xobj, as well as ./rel for release files, and ./dest for pre-release files
/usr/src 851M includes compiled kernel GENERIC.MP
/usr/ports 332M No compiled packages, no distfiles
/usr/obj 1G
/usr/xenocara 540M
/usr/xobj 420M
$DESTDIR 2G /usr/dest Includes cvs export for src, xenocara, and ports
$RELEASEDIR 500M /usr/rel Includes tgz source for src, xenocara, ports
$CVSROOT 4.4G CVS Tree scp'd from another server/workstation.
$CDBUILD 1.1G Contains pre-build CD directory and install.iso created with mkhybrid(8) (no packages) approximately 490MB each

One of those days, when the disaster you didn't want, barges through the door, but forward planning, preparations, testing gets you through the day. Also known as, we and our gweeky friends say "Ku-oool," while the rest of the family say, "uhhh, ok, we're happy for you."

We could have had a major disaster (i.e. my day ruined, as opposed to things melting down) which were nicely averted because of (as said before.)

  • forward planning
  • preparations
  • tests to verify the preparation.
  • activate on live system
  • what have we learned

The Disaster

Our PRIMARY data link provider suddenly went off the air. More of our workers are at remote sites, than are at the central office (where I'm sitting.) The WAN going down means that a lot of people are not able to do their work (or are impaired from using IT services they are normally reliant on.)

The diagram indicates the level of dependence those satellite sites have on this primary data center. Site A has a completely independent data service, so loss of the link limits a few operational issues for IT, but no loss of service to the business.

Site's B, and C, are independent for the majority of their business needs, but in the current situation are dependent on our Primary Data Center for shared services such as e-mail. Other than that, they can operate without the WAN link.

Sites D, E, and F can't work while the Primary Data Center is OFFLINE.

We couldn't connect to the provider's next hop link, and we definitely couldn't get any traffic, let alone BGP routing information.

All those nice tricks for verifying that your BGPD server is up and running are nice, but they don't do you any good when your 5 other sites confirm that the primary vendor's BGP Server is definitely not online

Forward Planning ?

After years of cajouling, the powers above folded and added a SECONDARY WAN service instead of the previous dependence we had of tunneling VPN through an Internet ISP connection.

Unfortunately, since there were budget constraints and the original WAN Data Link service was commissioned without regard for a secondary, we had to come up with some mechanisms for getting the SECONDARY connected.

After balancing different options with what the business operations required and our limited resources, we decided to configure the two systems as ACTIVE-STANDBY. One Link was ACTIVE (the Primary link) and the other configured as a STANDBY service. We could automate the switch, but given the reality of the infrastructure, we would meet a requirement of X hours to switch the data between the services(i.e. go from ACTIVE-STANDBY to OFF-ACTIVE)

Preparations

We gradually rolled out the secondary, backup, data link using off-the-shelf desktops as the routing/gateways. The routing, access policies were updated to include the potential for routing through the secondary link.

For some sites, and services, we load balanced traffic along both data links.

TEST

All the preparations were nice and dandy, but what would we actually have to do to make sure things were flipped from one service to the other? We needed to do a partial test on the actual network instead of our test network.

After some time, we just pushed through that downtime was required and a full service test is required taking all OFFLINE while we routing changes, tests (of course we had to do it during organisation down-time, which inevitably means that IT are up at odd hours or working during everyone else's downtime/bedtime)

Going through the preparations and controlled tests forced us to look at ways to minimise operator error during the process (controlled automation in as many bits of the process as possible.)

We successfully completed the tests on a subset of the full WAN network (site B, and D with the Primary Data Center,) found some further points in the operation that we wanted to improve and went through evolving those bits of the operation.

Suffice it to say, after that test, we were confident that we could switch over from FAILED-STANDBY to FAILED-ACTIVE well within the 2 ~ 4 hour window that was part of our agreement with business.

Activating on LIVE System

Doing my bit sleeping during one of those interminable meetings where you watch paint drying on the wall, or the back of your eye-lids (depending on how lucky you are.) One of the IT team woke me up, seriously disturbing the meeting, to say that all hell has broken loose. All sites were down, the WAN Link has disappeared. People were running trying to figure what to do next.

What do I tell XYZ at Site-A?
What do I tell everyone here at main office ?
What, when, where, who ?

I walk calmly to my desk, to find that my offsider (partner in these things) wasn't at his desk.

That's odd ?

Sit myself down at the desk. OK, look at through some of the charts generated by Smokeping, yup the primary link looks like it disappears about *here (pointing at the screen.) The charts also show that the secondary link is humming along just fine, although latency to Site B is off the charts (200 ms, is that even possible?)

My boss sees me working and goes to get a cup of coffee.

Log onto our WAN Gateway box, and yup our BGP Server is humming along just fine, we're advertising our LAN routes through BGP but that's all I can see (as mentioned earlier, the Primary linkn next hop is not responding to pings so we can't get to it and there's no hope of trying to get BGP traffic from/through there.)

Switching from the Primary Link to the Backup Link

ACTIVE-STANDBY to FAILED-ACTIVE

Using the shortcuts I've got, log onto 3 of the 6 remote sites through the secondary data link. Site D, E, and F. Site B is not connecting on either of its redundant active-passive gateways. Yep, BGPD is running fine on those sites, and showing advertising but no other routing information on those servers.

Run a script on each active gateway and we are now flipped over to the secondary link.

Total time to flip the link between 4 sites ? About 3 ~ 4 minutes after sitting down at the desk.

What happened to the other 3 sites?

Site A, and C we haven't rolled out the secondary links (Site A is wired but we haven't had anyone available to go down and plug things in. It's also a low prioarity. Site C is only a month old and just hasn't had reason for the secondary link, if the link failure is prolonged then users can work through the User VPN or we can set up a slow tunnel through the Internet.

Site B had the 200ms latency problem. My admin-buddy had to walk across to that office.

Testing the Service

Spent another 30~40 minutes going through the routing validation process, and refining the routing et. al. (yeah, you've really got to get a document together of these things, largely so you've actually gone through the exercise and have a clearer experience with what needs to be done.)

Fortunately, because we have QOS Queues on our gateways, specific for each Data Link Service, it is easy to confirm whether data is still routed through the Failed Primary Service, or if they are all going through the Active Secondary/Backup Service.

systat queue

We make some corrections in our queueing that were showing some traffic still showing up on the FAILED link. Adjusted a few things here and there that would simplify the whole process in the future.

Switch from STANDBY-ACTIVE to ACTIVE-STANDBY

Another 30 minutes passes, and the Primary Service comes back online. Since the Primary Service provides a much much bigger Data Link than our Secondary link, we are definitely very keen to put everything back onto it.

In two minutes, we were able to re-route all remote WAN sites to talk to each other through the Primary Link (to ease some of the traffic from the Secondary link) especially since this is a very minimal part of the traffic, but let's us look at the routing issue as well as whether the service can at least stay up for more than a few seconds.

After another while, we re-route all traffic back to the Primary link. That took another two minutes (at most.)

The last switch, no-one knew about.

What have we learned

Even with the knowledge we gained from the controlled TEST, we gained a whole lot more knowledge when having to perform the same process on the WHOLE network.

We've identified a few more areas that we can better administer, automate, and are in the process of updating those.

Putting the effort down up front sure saved my bacon, more important for the business, it meant that after jumping up and down that their network connection was down, the users could sit down and get on with work (making money for the company, serving customers et. al.)

Active - Active ?

Why aren't the Data Link's on Active-Active ?

Not really worth the effort at this point (not our call)

  • The Data Links are not equivalent, they have their different benefits but are not equal to make it an easy load balancing equation
  • Doable, but with a lot of 'moving parts' that will be difficult to maintain within our current resource constraints.
    • Remember that whatever knobs are tuned to get ACTIVE-ACTIVE has to be easy and quick to switch back when one of the services fail and we have ACTIVE-FAIL or FAIL-ACTIVE.

Where was my admin-buddy ?

Sometimes the call of nature is of even higher priority than your IT needs.

Summary

Smiling on the train home, 'cause I'm not working overtime tonight (you do get overtime don't you ? (smiling because we know we don't.))

Oh yeah, those six sites? They're connected using OpenBSD 4.8 redundant ACTIVE-PASSIVE gateways. Connecting to them, monitoring, managing during uptime and downtime are just a blast!!

Aka: Googling during a phone interview

This is tangentally relevant to OpenBSD, you can safely ignore it and you're life will not have missed anything. Take the road less travelled.

  • Ethics and IT
  • An example Ethical Dilemma
  • How many bits in a mac address
  • In Linux, what is the default signal sent by kill
  • Of the ps output what is the label D for
  • Summary

Ethics and IT

We continue to have some interesting discussions at work about the ethicacy of a lot of things we get around to in IT. For example, we're the guys that are brought on by various departments and HR to assist them in forensic type stuff which sometimes goes into trolling through peoples archives on our backup tapes (email, documents, etc.)

The generalised 'ethos' statement in the workplace seems to be:

if it's legal, then you do it.

But we have an abundant list of recent and current Global Events of totally unethical behaviour dressed 'legal' as defined by the conqueror to not be so enthralled by such simplistic misdirections.

An example Ethical Dilemma

Our ethical dilemma, within IT, for today was a phone interview I went through where purposeful trip-up questions were raised. Given time, some of the questions could possibly have been deduced, but why bother when you can easily Google/Bing to get your answer ?

Note: The field with a huge library of answers freely published online is IT (and fields where the IT crowd are fixated with, such as music, science fiction, and fantasy.)

The questions seem to have been good questions, in some manner, and definitely tripped me up because I didn't know, but do the questions reveal comparability of skills, or abilities to search the web?

One of my univesity courses, an Accounting course, had an open book final course exam (the only one I've ever been in) and this was largely so students didn't have to memorise any of the material, but if you didn't understand the material, there wasn't enough time to find answers and have it relevant to the problems in the exam.

Was this one of those problems ? Was my error in not asking / clarifying whether I could use [choice of favourite search engine]?

Hopefully you find the material educational in what it may be asking and how easy it is for IT personnel to find answers on the internet without having to memorise things. You still have to know your stuff to make use of the answers, but it is soo easy to find answers to IT things on the Internet these days.

Were these questions good IT questions ?

How many bits in a mac address

"Urgghhh, I don't know. I recall when I read them in places, that they're separated with colons, and theres something like four or more of them."

What races through my mind: "How could I figure this out with-out Googling?"

I'm talking with the interviewer that I'm trying to figure out the answer

  • I knew the address numbers were in hex (0,1,2...,d,e,f) but for the life of me I couldn't remember how many pairs there were.
  • I flipped over my laptop to look for a mac address, (you know some of the devices these days have it on a sticker) Nope, one of those stupid devices that has that sticker buried inside on top of the physical device.
  • The laptop was on, so I got a command-prompt and tried to look for the mac address. No go, Windows doesn't show it if the device isn't active (Grrrr, should have used ipconfig /all and that may have had the answer, and I just knew there was a reason I should have installed a Unix thing on this device, oh wait, I did and it didn't work for what I was using this laptop for: manpage: ifconfig )
  • OK, I've got a phone and these things have MAC address for their wifi. Can't use this, 'cause I'm on the phone.

Wait, ..., What's the difference between using Google/Bing and dissecting the answer from getting an example MAC and manually calculating the # of bits ?

What does it reveal

  • Have you had enough exposure in networking, especially at the command-prompt or configuration files, where this knowledge has become ingrained.

Why don't I know this ?

  • What was it again that makes it useful to have this knowledge? I recall some of those digits represent a unique id for the device vendor, and then the rest is used by the vendor to 'create' a unique ID for each physical device.
  • Tech Trivia: Microsoft published their standard where it used the MAC address with other items to create a GUUID for each word document (wow, that's even more useful knowledge) so they can track the origin of any word document around the globe.
  • We're the l33t of computer nerds, we are a fount of knowledge of the most trivial and irrelevant knowledge. This is just one of those that I now know, but had not come across it in any meaningful way before hand.
  • Where have I had actual reason to record them? MAC Address ACLs for squid-cache and dhcp, but obviously wasn't taking enough interest to even remember how many digits were involved, let alone the number of bits.
  • MAC addresses show up on ARP, but I haven't bothered to worry about them unless there was some conflict requiring further investigation.

And that was only the first question!!! Things are definitely not looking up for my interview.

We're in trouble and we haven't even passed the first step.

In Linux, what is the default signal sent by kill

Urggggh, never thought of that before. I may have read it somewhere but definitely haven't used it 'without an explicit' signal to 'know' what to expect as a default behaviour.

This one is simple enough to find from the manpage: kill(1) Straight there in the 1st Paragraph of the Description.

What does it reveal

  • Have you had enough exposure in Unix administration where this knowledge is ingrained.
  • NFI

Why don't I know this ?

  • Have to say, I've never used the kill command without an explicit signal. Didn't think it was the kind of command that was sane to be launching without explicitly telling it how to behave.
  • I guess the default is portable enough, since Linux and OpenBSD both agree on the default behaviour (using a sample of '2' to base this simplification)
  • I guessed at SIGHUP (-1) but that's just bias on what I try to do first before I do the KILL(-9).
  • Now, here lies a powerful tool not meant for most mortals. Including me 8<

Of the ps output what is the label D for

Urgghhh, OK, this interview is seriously becoming a disaster. Haven't really bothered with looking at the 'labels' except to see whether the service/app was a zombie or didn't even execute.

This one took a little longer to find (had to page through two screens to get at the answer), but it's right there in the ole manpage: ps(1) but look for it under the column 'state'

What does it reveal

  • Have you had enough exposure in Unix Administration where this information is ingrained?
  • NFI

Why don't I know this ?

  • Truthfully? Don't ever recall seeing this 'state' 'D' before to have investigated it.
  • Obviously haven't worked on enough resource constrained systems where the state 'D' was common enough to be noticeable.
  • The last time I had to really worry about an under resourced machine was with RedHat 4.0 or 4.2 and the i386 was blazingly fast, and we had 4 x 9600Kbps zyxel fax/voice/modem hanging off the box doing wonders no-one had ever heard off.
  • Well, the hosts I monitor are more single purposed, over engineered for their purposes (because that's the only hardware you can get these days.)

After learning a little more about 'D' I'm a little more pleased with my work environment than I was previously. There are some poor bastards out there who either don't get enough resources, or a dealing with real cool problems that have these 'D' issues.

Summary

If anything, I'm glad I've added to my glossary of commands, and leaves us with this lesson:

If you get a phone interview on a topic that is thoroughly covered by the Internet, clarify with the interviewer whether you're allowed to use the Internet as a resource, and if not, are you allowed to use other resources at your finger tips (and voice search on your phone doesn't count!!! because my phone runs an OS no one talks about.)

There may be no ethical dilemma, just the need to clarify.

8-)

Not that any of you would make such a disastrous error.

But, apparently you need to read documentation, and re-read it every once in a while, just in case you've forgotten why you previously made a decision.

Also known as, if you increase your management kung-fu, it may cost you in your technical 'chops'

FAQ: Packet Filter

Packet filtering is the selective passing or blocking of data packets 
as they pass through a network interface.

Somewhere along the line, I must have forgotten the above FAQ entry, as one copy / paste followed another as we progressed from one revision of the firewall rulesets to the next, to another OpenBSD upgrade, to another.

At some point a couple of years ago, I went through and replaced all these silly filter rulesets that looked like:

pass in on {carp0, em0}

to the more accurate

pass in on em0

So, I must have seen the 'correct' way to do it at some point, but all those dreams of pass in on carp0 kept floating around in my head that eventually, I came across a new feature I wanted to try (i.e. Stateful Tracking Options) and the late night dreams became a nightmare when I put it into the live ruleset, and back in comes:

pass in on {carp0, em0}

Not totally fixated with the current flavour of the month science-fiction novel, I look at that outrage and say to myself "that can't be!!!" Promptly I delete the offending eye-sore, and we have the beautiful

pass in on carp0

Wooohooo, reset the firewall, totally ignore the test-suites I've enacted for everyone else to perform whenever making any firewall ruleset changes. And, go to lunch.

If you haven't figured out what happened (more to the point, what didn't happen,) let's just say I had a lot cleaning up, not with just the firewall rulesets, but also with the services that weren't getting any traffic during that 'lunch break.'

But, the OpenBSD project isn't usually dependent on the FAQ for definitive statements on how things should be done. So, where does it actually say that you can filter in on one thing and not on another?

The pfctl(8) documentation has this at the beginning.

Packet filtering restricts the types of packets that pass through network
interfaces entering or leaving the host based on filter rules as
described in pf.conf(5).

The em(4) device driver, for a range of Intel NICS leads off with:

NAME
     em - Intel PRO/1000 10/100/Gigabit Ethernet device
SYNOPSIS
     em* at pci?

whereas the carp(4) manpage says

NAME
     carp - Common Address Redundancy Protocol
SYNOPSIS
     pseudo-device carp

For my own edification, I record these notes, because apparently the reading is that device drivers attached to a device is a network device and most definitely the carp interface is a pseudo-device (and as such is not a real network device)

Summary

In short, note to self: remember the following.

Life is organic, I make a lot of mistakes, and memory cells fade, confuse, and outright lie about what you remembered to have happened.

  • Read the manual pages
  • Read the FAQ
  • When you're confident about your invulnerability drink some kryptonite, and read the documentation again.
  • Set a suite of tests to verify changes you've made to any of your systems (make sure that current behaviour is not negatively effected)
  • Perform these tests, whenever you make changes.
  • Don't make changes before lunch (or going home, unless you've got remote access and can work on it while at home or having dinner with the family.)

Eventually had to get to the point of explicitly looking at potential denial of service attacks on the firewall.

For now we've implemented the following stratagem.

  • meter traffic and define what is abusive behaviour,
  • for traffic classified as abusive, put these IP Addresses in a bucket/table
  • drop any existing states from <abusive> users
  • block any further connections from that IP Address
  • At a later time, re-open connections from that IP Address

Your mileage may vary, but since it took almost an hour to figure out how these things work, I'm putting it up here as a pointer to read the manuals with some clearer understanding.

Fragments: /etc/pf.conf

table <abusers> {}

block drop in log quick on $external_if from <abusers> to any

pass out quick on $dmz_if tagged INTERNET_DMZ

pass in on $external_if from any to <webservers> \
    port https flags S/SA synproxy state \
    (max-src-nodes 50, max-src-conn 200, max-src-conn-rate 100/10, \
    overload <abusers> flush global) \
    tag INTERNET_DMZ

OpenBSD's Packet Filter supports a number of options, to monitor and manage the 'state' of packets as they traverse the firewall: "Stateful Tracking Options"

Meter Traffic, Define Abusive Behaviour,

Stateful Tracking Options that let us meter the traffic include the sample rule shown above:

pass in ... \
    ... \
    (max-src-nodes XXX, max-src-conn XXXX, max-src-conn-rate XXX/XX, \
    ...) \
    ...

In our above example, we use the parameters max-src-nodes, max-src-conn, and max-src-conn-rate to specify the maximum number of connections that we will allow, before we classify the connections as behaving abusively.

You're best bet for what those settings mean, is to look it up in the pf.conf manpage and FAQ. Below is a simplified explanatory.

  • max-src-node defines the maximum number of remote nodes that may connect through this rule. For our example we know this https service is through a very limited bandwidth, and services have a small local client base.
  • max-src-conn defines the maximum number of connections from a remote node supported by this rule. For our example, we observe many connections for the web service from a single user connection.
  • max-src-conn-rate defines, for a remote node's connection, the maximum number of connections per second. For our example, the metered rate is sufficient for our users.

Any IP Address that breeches any of the above boundaries will categorise that IP Address. For a site allowing 50 max-src-nodes, if it is under attack and your legitimate user is node #51, they will be blocked together with other members of the DDOS attack.

Solving that particular solution, is left up to the users ingenuity. But, there's enough flexibility in pf to let you deal with the above gracefully.

Abusive IP Addresses in a table

When network traffic for the above rule exceeds the set maximum boundaries, we categorise IP Addresses exceeding these boundaries by placing them in a PF table <abusers>

pass in ... \
    ... \
    (... , \
    overload <abusers> ...) \
    tag INTERNET_DMZ

The table <abusers> will now contain IP Addresses of any remote node that has exceeded our set boundaries.

Drop Existing States

The presumption, in this sample, is that if a remote node is abusing connections to our site, then we need to drop all connections from that IP Address.

pass in ... \
    ... \
    (... , \
    ... flush global) \
    ...

Block Connections from that IP Address

block drop in log quick on $external_if from <abusers> to any

Un-Block Connections from that IP Address

We can use pfctl to remove IP Addresses that have been in the <abusers> table for a set amount of time.

At the command-line, we can use something like the below

pfctl -t abusers -T expire 3600

Which would transfer to a regular/scheduled check in your crontab to something like this.

*/5       *      *      *      *      pfctl -t abusers -T expire 3600

We now have examples of some mechanisms for monitoring and mitigating against Denial of Service. The cool thing about the OpenBSD packet filter solution, is that you have enough hooks into the system that you can build upon it using pfctl for a more complicated solution when your environment requires it.

We also leave as an exercise, reading up on further PF filtering:

pass in ... \
    ... flags S/SA synproxy state \
    ... \
    ... \
    ...

What else do we have in OpenBSD's Packet Filter to mitigate against Denial of Service attacks ? There's more you can look up together with what's been alluded to above:

  • Access Controls
  • Rate Limiting
  • Traffic Shaping
  • Quality of Service
  • Packet Re-assembly
  • SYN Proxy

And then, there is your ingenuity to mould the above tools to your directions. I'm sure you all got that, and more, from reading through the man pages (and if you didn't, please share your discovery with us?)

Stay Safe.

Another case of trying to avoid the inevitable.

spamhaus.org and rfc-ignorant.org are an important part of your overall antispam arsenal. The only problem is that although many of these services are free, you do need to at least:

  • confirm you are working within their terms of use
  • confirm the services are still valid

MX Proxy Extended, using Multiple Instances has been updated to something that works for me. I'm sure there are better solutions out there. But it doesn't hurt to try, or be exposed to other methods?