Slides for the talk alligator at yapcna-2013
-
Architecture
Automation:
One Alligator
At A Time
-
Alligators?
-
"When you're up to your
ass in alligators it's
hard to remember you
were supposed to be
draining the swamp"
-
Or: When you're constantly
fighting fires it's hard
to find time to fix the
underlying problems
-
Alligators
are a better
metaphor
-
Shadowcat
consults
-
Early stage
startups
-
Technical
debt
-
... beats
running out
of money
-
Product
market
fit
-
"oh shit
this is
a mess"
-
... a mess
with happy
customers
-
How do you
fix it?
-
How do you
fix it while
still adding
features?
-
Easy!
(sort of)
-
Step 0
-
What servers
do we
even have?
-
Seriously.
-
Your datacentre
should be able
to list them.
-
Your cloud/VPS
provider will
have an API.
-
Start from
there.
-
There's always
one you forgot
about.
-
... and it's
probably a
SPOF!
-
Step 1
-
What the
hell are
we running?
-
(bonus points
if the guy
who knew left)
-
Documentation!
-
Do you trust the
documentation?
-
Don't.
-
Your systems
know what
they're running
-
What's
installed?
-
dpkg can
tell you
-
rpm can
tell you
-
Your OS can
tell you
-
Ask it.
-
What about
custom code?
-
Repositories
local::lib dirs
-
Repositories
-
locate .git
-
Ok, so where
did it come
from?
-
.git/config
-
.git/config
git remote -v
-
Is it the
real thing?
-
git status
git log
git diff
-
local::lib
-
... ah.
-
perllocal.pod
-
... Module::Build
doesn't write it
-
Uh ...
-
Tim Bunce to
the rescue!
-
Dist::Surveyor
-
Warning: does
clever things.
-
Warning: does
clever things.
This takes
a while ...
-
So now you
know where
your code is.
-
... what
talks to what?
-
*argh*
-
Step 2
-
Enumerate
running services
-
/etc/init.d
-
Well, yeah
but ...
-
daemontools
runit
ubic
...
-
ps ax
-
No,
really.
-
ps ax
lsof
-
ps ax
lsof
netstat
-
ps ax
lsof
netstat
(also /proc)
-
All daemons
All files
All connections
-
Now you can
cross
reference
-
Dump the
output into
a wiki page
-
Easy viewing
Free history
-
Mediawiki API
works fine
-
Dump the output
into a git repo
-
JSON::Diffable
-
diff
log
blame
-
grep out
things you
recognise
-
Work out
what the
rest is
-
Repeat.
-
Repeat.
Repeatedly.
-
So now we know
what talks to
what, and why.
-
One more
thing.
-
grep. everything.
for IP addresses.
-
There will
be one
somewhere.
-
No, really.
There will.
-
Step 3
-
Go find a
beer to
cry into.
-
-
"When you're up to your
ass in alligators it's
hard to remember you
were supposed to be
draining the swamp"
-
Or: When you're constantly
fighting fires it's hard
to find time to fix the
underlying problems
-
Alligators
are a better
metaphor
-
Wild Bill
Walton
-
Mad
texan.
-
Master of
the folksy
metaphor
-
(on sales guys
managing techs)
-
"When it comes to
technical management,
that man couldn't find
his ass with both hands
and a hunting dog"
-
(reminding me that
he -is- a technical
manager and I don't
need to use small words)
-
"I get it, Matt,
this ain't my
first rodeo"
-
Why am I talking
about this?
-
Because these
metaphors worked
for techs and
managers
-
"This is a swamp"
versus
"We have some
technical debt"
-
Guess which one
sticks in the
listener's mind?
-
"When you're up to your
ass in alligators it's
hard to remember you
were supposed to be
draining the swamp"
-
So, thanks
to Wild Bill.
-
Wild Bill
Walton
R.I.P.
-
This talk's
for you.
-
-
So, what
do we
know?
-
Systems
Packages
Code
Services
Dependencies
-
Now we
can plan.
-
First
thing.
-
If you can,
use fresh
machines.
-
Your existing
systems -will-
be missing
security fixes.
-
Assume
the worst.
-
Fresh installs
are controlled,
known installs.
-
One alligator
at a time
-
One service
at a time
-
Firewalls
aren't just
for security
-
Firewalls
keep your
dependencies
honest
-
Automation
approaches
-
Pick something
pull based.
-
I don't really
care what.
-
Sysadmins seem
to prefer puppet
-
Developers seem
to prefer chef
-
For the basics,
they're largely
equivalent.
-
Just pick one!
-
Pull based.
-
Why?
-
Pull based
systems
converge
-
System down
when an update
goes out.
-
Network blip
when an update
goes out.
-
System overloaded
when an update
goes out.
-
New system
added to a
cluster
-
All these matter
when you push
-
None of these
matter when
you pull.
-
Pick something
pull based!
-
Config
generation
-
Your tool
can probably
template things.
-
Your tool
can also
call scripts.
-
If you already
know TT ...
just use TT.
-
Rule of
thumb.
-
Don't be
clever.
-
"This is systems.
You are trying to
be clever. Stop."
-
Step 0
-
Eliminate any
IP based
configuration
-
I don't care
if you do it
manually.
-
Just make
sure you
do it.
-
DNS is a
mess?
-
Fine.
-
rsync
/etc/hosts
-
Really. It's
not clever
but it works.
-
Step 1
-
Backup
everything
-
"But everything
must already be
backed up"
-
HAHAHAHAHAHA
-
Check.
-
Step 2
-
Build new
machines and
restore backup
data onto them
-
(now you've
tested your
backups :)
-
Point a development
machine at the
new systems
-
Change something.
Check the slaves.
-
Concept
proven.
-
Step 3
-
Migration
strategy
-
Customer
facing
service?
-
Probably
HTTP then?
-
Don't trust
DNS timeouts.
-
www2
-
www.myservice.com
www2.myservice.com
-
Redirect
www2 -> www
-
Wait a day
or two.
-
Redirect
www -> www2
-
If it catches
fire, back it
back out!
-
Wait a day
or two.
-
Still not
on fire?
-
Change www
DNS entry
-
Wait a day
or two.
-
Redirect
www2 -> www
-
Guess what?
-
... wait a
day or two.
-
Done!
-
Yes this
is boring.
-
This is systems.
Boring is GOOD.
-
Internal
services
-
Now you can
trust DNS
-
... but it's
stateful.
-
Most of them
do master/slave
-
Some of them
do clusters
-
Sometimes
this is fine.
-
Sometimes
this is too
clever.
-
Here's the
stupid way.
-
rsync
-
rsync
rsync
rsync
-
rsync
rsync
rsync
(halt?)
-
Actually ...
-
Once the
rsync is
under 5s
-
Stop services.
-
Stop services.
Stop dependencies.
-
Stop services.
Stop dependencies.
Change DNS.
rsync once more.
-
Stop services.
Stop dependencies.
Change DNS.
rsync once more.
Start services.
Start dependencies.
-
Done!
-
Sound kinda
horrible?
-
It's entirely
brute force.
-
It's entirely
PREDICTABLE.
-
And your outage
window is short.
-
Clever cluster
and slave
trickery
-
Can be zero
outage
-
... can go
horribly
wrong.
-
Pick your
poison.
-
Step 4
-
Go find a beer
to not cry into
-
Decide which
service will
be next.
-
Repeat.
-
This is not
rocket
surgery.
-
Keep it simple.
-
Keep it simple.
Keep it stupid.
-
Keep it simple.
Keep it stupid.
One alligator
at a time.
----
Thank You
IRC:mst
mst@shadowcat.co.uk
@shadowcat_mst