yapcna-2013 - alligator

Name: Shadowcat Systems Limited
Address: The Barracks, White Cross, South Road, Lancaster, Lancashire, LA1 4XQ, UK
Telephone: +44(0)1524 64544

Sat Dec 22 00:30:00 2012

Slides for the talk alligator at yapcna-2013

Architecture
Automation:
One Alligator
At A Time

Alligators?

"When you're up to your
ass in alligators it's
hard to remember you
were supposed to be
draining the swamp"

Or: When you're constantly
fighting fires it's hard
to find time to fix the
underlying problems

Alligators
are a better
metaphor

Shadowcat
consults

Early stage
startups

Technical
debt

... beats
running out
of money

Product
market
fit

"oh shit
this is
a mess"

... a mess
with happy
customers

How do you
fix it?

How do you
fix it while
still adding
features?

Easy!
(sort of)

Step 0

What servers
do we
even have?

Seriously.

Your datacentre
should be able
to list them.

Your cloud/VPS
provider will
have an API.

Start from
there.

There's always
one you forgot
about.

... and it's
probably a
SPOF!

Step 1

What the
hell are
we running?

(bonus points
if the guy
who knew left)

Documentation!

Do you trust the
documentation?

Don't.

Your systems
know what
they're running

What's
installed?

dpkg can
tell you

rpm can
tell you

Your OS can
tell you

Ask it.

What about
custom code?

Repositories
local::lib dirs

Repositories

locate .git

Ok, so where
did it come
from?

.git/config

.git/config
git remote -v

Is it the
real thing?

git status
git log
git diff

local::lib

... ah.

perllocal.pod

... Module::Build
doesn't write it

Uh ...

Tim Bunce to
the rescue!

Dist::Surveyor

Warning: does
clever things.

Warning: does
clever things.
This takes
a while ...

So now you
know where
your code is.

... what
talks to what?

*argh*

Step 2

Enumerate
running services

/etc/init.d

Well, yeah
but ...

daemontools
runit
ubic
...

ps ax

No,
really.

ps ax
lsof

ps ax
lsof
netstat

ps ax
lsof
netstat
(also /proc)

All daemons
All files
All connections

Now you can
cross
reference

Dump the
output into
a wiki page

Easy viewing
Free history

Mediawiki API
works fine

Dump the output
into a git repo

JSON::Diffable

diff
log
blame

grep out
things you
recognise

Work out
what the
rest is

Repeat.

Repeat.
Repeatedly.

So now we know
what talks to
what, and why.

One more
thing.

grep. everything.
for IP addresses.

There will
be one
somewhere.

No, really.
There will.

Step 3

Go find a
beer to
cry into.

"When you're up to your
ass in alligators it's
hard to remember you
were supposed to be
draining the swamp"

Or: When you're constantly
fighting fires it's hard
to find time to fix the
underlying problems

Alligators
are a better
metaphor

Wild Bill
Walton

Mad
texan.

Master of
the folksy
metaphor

(on sales guys
managing techs)

"When it comes to
technical management,
that man couldn't find
his ass with both hands
and a hunting dog"

(reminding me that
he -is- a technical
manager and I don't
need to use small words)

"I get it, Matt,
this ain't my
first rodeo"

Why am I talking
about this?

Because these
metaphors worked
for techs and
managers

"This is a swamp"
versus
"We have some
technical debt"

Guess which one
sticks in the
listener's mind?

"When you're up to your
ass in alligators it's
hard to remember you
were supposed to be
draining the swamp"

So, thanks
to Wild Bill.

Wild Bill
Walton
R.I.P.

This talk's
for you.

So, what
do we
know?

Systems
Packages
Code
Services
Dependencies

Now we
can plan.

First
thing.

If you can,
use fresh
machines.

Your existing
systems -will-
be missing
security fixes.

Assume
the worst.

Fresh installs
are controlled,
known installs.

One alligator
at a time

One service
at a time

Firewalls
aren't just
for security

Firewalls
keep your
dependencies
honest

Automation
approaches

Pick something
pull based.

I don't really
care what.

Sysadmins seem
to prefer puppet

Developers seem
to prefer chef

For the basics,
they're largely
equivalent.

Just pick one!

Pull based.

Why?

Pull based
systems
converge

System down
when an update
goes out.

Network blip
when an update
goes out.

System overloaded
when an update
goes out.

New system
added to a
cluster

All these matter
when you push

None of these
matter when
you pull.

Pick something
pull based!

Config
generation

Your tool
can probably
template things.

Your tool
can also
call scripts.

If you already
know TT ...
just use TT.

Rule of
thumb.

Don't be
clever.

"This is systems.
You are trying to
be clever. Stop."

Step 0

Eliminate any
IP based
configuration

I don't care
if you do it
manually.

Just make
sure you
do it.

DNS is a
mess?

Fine.

rsync
/etc/hosts

Really. It's
not clever
but it works.

Step 1

Backup
everything

"But everything
must already be
backed up"

HAHAHAHAHAHA

Check.

Step 2

Build new
machines and
restore backup
data onto them

(now you've
tested your
backups :)

Point a development
machine at the
new systems

Change something.
Check the slaves.

Concept
proven.

Step 3

Migration
strategy

Customer
facing
service?

Probably
HTTP then?

Don't trust
DNS timeouts.

www2

www.myservice.com
www2.myservice.com

Redirect
www2 -> www

Wait a day
or two.

Redirect
www -> www2

If it catches
fire, back it
back out!

Wait a day
or two.

Still not
on fire?

Change www
DNS entry

Wait a day
or two.

Redirect
www2 -> www

Guess what?

... wait a
day or two.

Done!

Yes this
is boring.

This is systems.
Boring is GOOD.

Internal
services

Now you can
trust DNS

... but it's
stateful.

Most of them
do master/slave

Some of them
do clusters

Sometimes
this is fine.

Sometimes
this is too
clever.

Here's the
stupid way.

rsync

rsync
rsync
rsync

rsync
rsync
rsync
(halt?)

Actually ...

Once the
rsync is
under 5s

Stop services.

Stop services.
Stop dependencies.

Stop services.
Stop dependencies.
Change DNS.
rsync once more.

Stop services.
Stop dependencies.
Change DNS.
rsync once more.
Start services.
Start dependencies.

Done!

Sound kinda
horrible?

It's entirely
brute force.

It's entirely
PREDICTABLE.

And your outage
window is short.

Clever cluster
and slave
trickery

Can be zero
outage

... can go
horribly
wrong.

Pick your
poison.

Step 4

Go find a beer
to not cry into

Decide which
service will
be next.

Repeat.

This is not
rocket
surgery.

Keep it simple.

Keep it simple.
Keep it stupid.

Keep it simple.
Keep it stupid.
One alligator
at a time.
---- 
Thank You
IRC:mst
mst@shadowcat.co.uk
@shadowcat_mst