UKUUG Spring 2010 - day 1

Notes from the audience

Thu Mar 25 08:00:00 2010

Herein, a collection of random notes from being sat in the audience of day 1 of the UKUUG Spring 2010 conference. Links inwards to the notes from specific talks follow:

Simon Wardley, mad as a bus as always

Virtualisation Security, except I got lost.

Apparently I gave a talk as well.

OpenAFS and Gerrit, in which lemon yellow is perpetrated but code review is rather cool

Ubuntu Enterprise Cloud, in which we conclude Eucalyptus should probably have stuck to being a plant

FreeBSD network virtualisation, in which your author is somewhat out of his depth by extremely impressed

lcfg-xen, in which the syntax is on crack but the functionality is actually quite neat

Google's MySQL clusters, in which a group hug occurs for those afflicted with MySQL

And now, onwards to the notes themselves:

Simon Wardley

Mad as a sack of badgers. Below is a little stream of consciousness as I slurp coffee and try and work out what the hell he's talking about.

67 definitions of cloud computing
Taxi driver
Slides and audience pain
"Trains have wheels"
Disruptive. Staggered.
Kitten with a sniper rifle ...
taxi driver. innit.
Giant overlap graph thing of industrial revolution ... looked like buttocks.
"What are the benefits of cloud?"
25 minutes in ... "you can provision faster"
... and he's off onto mad science again
"Imagine if every time you wanted an application you designed a chip, everything
would take orders of magnitude longer to get done"
... now I'm pondering a Chuck Moore / Chuck Norris joke ...
Recap two. The graphs take more acid every time.
Could mean less sysadmins ... except it could mean more sysadmins.
Uncertain. Deviation. Differential.
Lots of repetiation but as yet surprisingly little hesitation ...
"so long as we all really suck at project management we're ok"
ubuntu.com/cloud
no exact explanation for why I need a giant ball of java to script JVM
"summarise"
graphs on even more acid
"commoditisation"
"demand"
"improvement"
"that is what is causing cloud" ... english, boy, do you speak it?
"more about manaagement than technology"
"benefits vs risks"

More comprehensible comments from the Q&A:

"Cloud is a big change but the term itself is a complete load of utter drivel. There is a real academic background to what is going on that's been destroyed by marketing, and so people have a lot of trouble understanding what cloud is."

"Forget the what, understand the why."

"You as consumers need to gang up before the cloud gets you"

----

Virtualisation and security.

I missed this entirely because I couldn't find the room. Best question (which I did arrive for) - "is it just me or is cisco trying to charge me three grand for what's basically a VMWare image?". Apparently it wasn't quite that but I was too busy trying not to laugh to really take in the answer.

----

Somewhere about here I gave a talk. Highlight being the organiser nearly giving me a heart attack by getting the timing confused and telling me I had five minutes left instead of fifteen. DAMMIT DON'T SCARE ME LIKE THAT!

----

Simon Wilkinson talking about community mangling for OpenAFS

Code review is hard in an open source environment - if you don't have enough volunteers you can't review, if you can't review you either lose changes or push untested changes.

Distributed version control as a means to permit companies to maintain their own history so large change drops come with history and a commit log.

Only want to change once for VCS; git on win32 still not mature enough. git is orders of magnitude faster than mercurial (for mozilla) and repos half the size but chose mercurial; OpenAFS has less win32 devs.

CVS and win32 - take a file, copy it to unix, fiddle, commit. win32 devs small enough in number and happy to continue this workflow so OpenAFS went git.

CVS to git translation - repo conversion is always hard and the off the shelf stuff doesn't work for one of their complexity (especially with CVS branches).

Linux just dumped their current state into git; OpenAFS wanted to preserve history (huge code drops -> blame really important)

"CVS is like going into a pottery shop and smashing all the jars onto the floor" - CVS is file revision based rather than changeset based.

Reassamebling CVS change logs into changes is based on heuristics - "if files have the same commit message and were made at about the same time (allowing for clock skew)" ... I'm trying not to flash back to CVS administration and scream in the middle of the talk but it isn't easy right now.

Second mention of cvsps which apparently handles this great until it meets a branch, at which point it curls up in a foetal ball in a corner and sobs.

Associating changesets to tags is simple in theory ... except that the changesets were generated by heuristics so if they didn't get it right now the tags don't match up - time to re-heuristic based on the tags.

"What you end up with is a very complicated constraint solution".

... oh my god that's a lego Doctor Who and TARDIS, he's talking about time travel but that's definitely my favourite image so far LEGO TARDIS SQUEE.

People's clocks change. So even allowing heuristics you have to remember that the times may not make any sense at all.

Final solution: start with cvsps, throw the output into perl and re-order the changesets based on the known constraints until you get something mostly consistent. Emphasis on mostly - after TWO AND A HALF YEARS they decided to set a time based deadline rather than keeping aiming for perfection.

Code review - "here is this 10,000 line patch, can you try and find time to read it?" works about as well as you might imagine. This is bad when a bug can eat your users' files. This is worse when testing is hard (and it's a filesystem, of course testing is hard ...)

GSoC to the rescue!

They went out to the googleplex and talked to people, met Shawn Pierce, a serious git hacker who'd been working on internal code review tool.

Guido's Mondrian apparently was a game changer inside google but was oriented around their internal stack so impossible to use outside (also - subversion based).

Enter Shawn, who wanted to make android code review possible - produced Gerrit, a JVM based system that uses git as a backend.

... he's doing a live demo of it. Fear! Fear! Run, man, while you still can!

Links into their bugtracker. Side by side diff. (much though I hate trac I'm so glad it made people fall in love with highlighted side by side diff).

Aha. Side by side AND YOU CAN ANNOTATE specific lines of the change. Neat.

I wonder who thought white and lemon yellow was a really good colour scheme? My brain keeps saying "a suffusion of yellow".

Huh. The terminal he just borrowed included him failing to get the spelling and capitalisation right to full up the MooseX::Getopt POD. Nice to see the outside world is noticing and trying to adopt these things (and now I want a patched perldoc that can heuristic for spelling errors against installed stuff and maybe say "er, yeah, it's $this_spelling but you didn't install it yet")

... and there goes the network. Terminal hung but the web interface shows stuff! Live demo 1, conference wireless 0, full marks that presenter!

Continuous integration, automated, using buildbot to trap black smoke before a unit test breaking change makes it to the repo - but limit this to trusted users since the gerrit allows anybody to submit a patch.

They're using RT. Gerrit integrates to RT. OOOOH.

Pretty graph of commits per month shows that post migration their commit rates have gone up - though I wonder how much of this is review getting more changes in and how much is review and git meaning patches are smaller.

(I asked at the end - gut feeling is half and half - still good then)

Number of developers submitting more than one patch is up - promising.

Shout out to informatics for letting Simon spend work time on this - and apparently they've adopted some of this in-house as well: on LCFG and Prometheus, so far (LCFG being a systems automation tool and Prometheus an account management engine).

"git encourages branching, this is good" ... I'm reminded again that one of git's great achievements is to make branching natural - while things like svk made branching *work* it never took off as part of the workflow whereas git's forcing everything to effectively be a branch has really pushed it into standard practices.

Gerrit provides an sshd that answers to git, but it's easy enough to sync changes from there out to github or wherever else. Does mean that that becomes your "main" repository though - I'm ambivalent as to whether that's ideal, but I think the integration makes it worth a try at least (and Gerrit like gitolite can handle per-branch permissions).

Gerrit is pure java but calls external hooks - so their buildbot and RT integration is done through perl script calls.

Oh, and I went and annoyed the presenter afterwards and he says that everything the web interface can do is also available through the ssh server. Win. We shall have to play with this ...

----

Next up: The Ubuntu Enterprise cloud stuff, a talk about the actual thing rather than Simon's insane rant.

Eucalyptus as "EC2/S3 inside the firewall" - compatible enough you can use the same client tools.

"Hybrid cloud enabler" ... this sounds like it should be illegal and involve narcotics but never mind.

Needs two machines to get going (at least for test).

Node controller 1 needs the virtualisation extensions on the CPU - dual core minimum, quad for preference if you're doing anything real.

There's a very pretty diagram up but none of the text is legible from the back. It's definitely pretty though.

It's recommended that you have a separate network for the node controllers versus the storage controllers. There's also a third thing called a cluster controller but it's unclear exactly what this is. I think "node controller" means "VM host" and cluster controller means "where the admin stuff runs".

On install and configuration, his slides says "Easy and quick (and I will prove it)" - I foresee another live demo ahead.

Components publish their existence via avahi but I have no wireless and nothing says what that is. Oh, but you -can- turn the automatic stuff off so your network isn't yelling "violate me!" at the world at least. Probably still best to use a private network for that.

Aaand here's the start of the demo, "10 minutes to install but we'll have to wait a bit because it'll have to time out when it doesn't find the network".

It looks for a controller server and uses that if there, in absence appears to be defaulting to being the controller. Presenter goes back to slides while waiting for a search for NTP to time out.

Stuff goes to sleep if it isn't doing anything and if they're needed later the controller wakes them back up again - should be good for power usage.

Distribution of images is built in and there's a point and click web UI to fire up one of your pre-built images.

Eucalyptus came out of a university project hence some weirdness. The company was created just after and is trying to be a product company except still open source; not ideal since it means code drops are rare. Plus it's a mix of C and Java and of course Java doesn't always integrate amazingly well with linux distros. Apparently they had some fun getting code before time and only had a month of integration for the first 09. Ubuntu to ship it.

Meanwhile, the cloud controller appears to have installed postfix on itself so it can complain when things break and is asking for a cluster name and choice of IPs. Basically to me "the installer is sanely integrated with debconf".

wrt upstream: "It's been difficult to beat some open source sense into them". This might explain why the Eucalyptus main site is such complete bullshit - the Ubuntu guy seems pretty clear that these guys don't "get" open source.

Ubuntu is trying to convince them to depend on things that everybody else already does, but the presenter's tone of voice suggests the emphasis is on "trying" so far.

Some complaining about Java people not testing with newer versions of dependencies since they're used to a bundle-everything deployment model (insert snarky comment here about .deb failing to handle multi-version but I really do feel for the guy)

Bullet point of the presentation so far:

* Lots of dependencies of dependencies (and some hacks)

"You'd never see Eucalyptus if we had to do everything properly" ... well that bodes well ... some muttering that in six months they might figure out wtf they're doing but it doesn't sound amazingly hopeful to me.

It's installed itself! Time to try and start the node!

"Small upstream team, young code, lots of QA issues" ... most of 10.04 release cycle was spent stabilising the code drop from 09.10. And of course virtualisation in virtualisation to test the installers becomes a recursive nightmare very quickly. Needs 5 machines to do useful tepting, but at least they can use laptops so you just need a big desk rather than a personal datacentre (most of the Ubuntu server devs work from home).

The Eucalyptus guys have apparently figured out that writing tests would be a good idea, and Ubuntu are building out a test farm to run them.

In spite of Eucalyptus being libvirt I'm getting the strong impression only kvm works as the underlying virtualisation engine but that's planned to be fixed.

Hmm. "This is something we want them to add and we'll push them to add" about directory integration suggests the Eucalyptus guys haven't really got the hang of taking contributions yet. Also, some hate for EC2 as an API since it's not a standard and, well, it's a giant blob of SOAP horror. Sigh.

Maybe somebody will create a standard but he's not mentioning anything on the horizon yet so don't hold your breath.

We're onto Q&A and the progress bar for the node build is still creeping across ... half an hour may have been slightly optimistic. Also, my foot has gone to sleep. Waving your foot about to try and restore blood flow without further annoying the nearby people who've already been driven insane by my incessant typing is not an easy opportunity. Also, the pins and needles HURT.

...

IT BOOTS!

But we can't read shit on the console at this distance and apparently now it's stopped again for a bit while the JVM warms up. The muttering suggests the engineer really loves Java for this and rather wishes somebody would do Eucalyptus but sane in another language (any takers? :).

"How many single points of failure are there?" "A lot. They've been working on it."

And there's no node migration yet.

And it's full of NIH, and Canonical would probably prefer to rewrite than fork.

"The Ubuntu server team is like 5 people, we don't have the manpower"

Apparently the Eucalyptus guys just hired the old MySQL CEO. Because that's going to make their open source contribution handling sooo much better ...

I think somebody just asked how amazon are involved.

"They ignore us, I guess."

Sigh.

"Don't try and run an EC2 competitor on this yet" ... no shit.

I don't feel like the presenter got the applause he deserved for the level of honesty displayed and not dodging hard questions - also full marks to Canonical for sending him out with slides that don't try and hide the scary bits even if they do downplay them a little.

But still, Eucalyptus is definitely Not Yet Made Of Win.

----

FreeBSD: The new World.

"This is a sort of work in progress talk, but it's been in progress for over 10 years now so we're probably reaching the end of being able to claim that."

Talking about jails and virtualisation

"Illusion of multiple virtual X on one real X"

"... you can solve any problem with another level of indirection ..."

Interesting point that per-process virtual memory address spaces is another form of virtualisation - but one so ubiquitous nobody even sees it as such anymore. Same for VLANs, VPNs, storage volume management ...

Reasons for virtualization: sharing with the illusion of exclusive use - "managed overcommit". ISPs driven by space, power, cooling.

Talking about virtualisation as a spectrum from single-OS-image access control (e.g. SELinux) to virtualisation within the OS like Jails and Solaris zones to hypervisors to physical hardware.

Interesting note that running multiple operating systems on a single physical host can perform far worse than you might expect due to the differing expectations and strategies of the schedulers.

"If you want to host a thousand virtual machines on a single piece of hardware Xen probably isn't the way to go" - ISPs are apparently hosting tens of thousands of jails on one system; presumably this explains how models like nearlyfreespeech.net are viable.

ZFS as "jail-friendly" filesystem because of the way sets can be built and controlled.

Virtualisation of the network is experimental in fBSD 8, likely production in 9.

Jail "subsets" rather than "virtualizes" - basically a chroot() on steroids. Which runs into problems with non-hierarchical things like Sys V IPC, loopback etc.

New kernel level concept - "virtual instance" to allow previously global objects to be replicated per instance. Hard engineering-wise since this stuff has to happen per-packet for networking so "watch the cycle count".

Not that this contrasts impressively to waiting for a JVM to settle, oh no ...

Changing the way the heap works to virtualise more easily. Neat.

Are TCP packet drops per-stack? global? Who can ask what?

If you're careful you can share a lot of resources between instances which makes for way more efficiency than something like Xen (but is also a large part of why jails will never migrate between host systems - that's not a target though, tradeoffs, tradeoffs).

Per-CPU storage already provided a means to have C statics in the kernel do the right thing - just have to extend it to be per-network-stack and tag them as 'virtual' in the sources. Then these go into a different ELF section when linked, and each virtual network has its own copy of that with a pointer from the kernel thread to the vnet structure.

... but threads can change networks, so there's push/pop stack of network concepts - and if you turn the option off then the entire thing compiles out entirely, and they'll be leaving that in place for instruction count sensitive environments like small embedded rather than the cache miss sensitive situation of servers.

  VNET_DEFINE(struct incpcbhead, ripcb);

  #define V_ripcb VNET(ripcb);

  LIST_INIT(&V_ripcb);

Neat. Reminds me a lot of the way threads and multiplicity are handled inside the perl5 VM. But doing this to an *OS*, that's cool!

Some boot events need to be virtualized so they basically create a new event set that's run once per vnet setup rather than only once globally during boot.

They want to extend this beyond networking - VIPC ... oh my. But VNET is a great starting point - "if it's not going to work for network stacks it's not going to work for anything else" - plus they have lots of network bits and 50 active developers in the network stack (many more than, say, filesystem) so there are human as well as technological reasons to start here.

"The overhead of creating a jail on FreeBSD is 5k if non-paged memory, a VNET 300k". Wow but that's lightweight.

Network interfaces (ifnet structure) are assigned to exactly one virtual network stack. But of course you can create virtual ones of these as well as the concrete per-physical-network-interface ones. This means letting packets float between stacks and still getting the free() stuff to work right, which he glossed straight over but I bet was "fun" to get right ...

Shutting down is stunningly hard. Because the network simply went away when the machine was powered off, so it didn't need to bother freeing anything (and shouldn't because you'd swap a shitton paging stuff back in just to free it for power off).

Terminating processes kills the sockets kills the networking but that doesn't mean the routing table was cleaned up. So right now things leak memory because all this stuff needs destructors retrofitting.

"Leaking kernel memory is bad because it's not paged memory so you can't just swap it out."

FreeBSD Foundation is hopefully throwing some money at this with a view to getting production quality in 9.1/9.2

... but people are already using the experimental stuff in 8.0 in production. Of course they are. People run everything in production.

"only at the end of FreeBSD 5 development (5 year cycle) did we discover that two appliance vendors had been shipping it for 3+ years already"

8-STABLE and 9-CURRENT have this, see http://wiki.freebsd.org/Image

PRESENTER'S WARNING: EXPERIMENTAL

MST'S WARNING: THIS LOOKS WAY TOO COOL, YOUR FAULT IF YOU DEPLOY IT AS A RESULT

Now he's talking about people experimenting with routing simulation by creating a thousand network stacks and sticking a BGP instance in each one.

One of their developers is running multiple dozens of VNETs on a single Soekris box - 500k memory overhead total for a jail plus a VNET.

"10 years is a pretty typical time to take something from a university research project and get it into production" ... the contrast with the previous talk amuses me. No prizes for guessing which I'm more impressed by.

Asked about DDoS resistance if one VNET gets nailed by too many packets. I don't think I was qualified to understand the answer except that "you'd probably find anything on the same ithread is backed up but things on other ithreads would be fine - and you can pin the base OS to one and keep everything else off it". Having seen a zone-based Solaris setup lock globally when one zone gets packeted, I think I'm impressed.

Also, another beautiful quote:

"10 Gigabit hardware is new and special and sometimes it comes with 'features'"

Mmm ... 'features' ...

To another question: if the kernel module was correctly written you could easily run something like virtualbox inside a jail, but this stuff introduces a whole new level of 'correctly' so the code would probably want an audit before you trusted that. Fascinating concept nonetheless.

----

LCFG

Talking about abstracting away physical machines versus virtual machines versus virtualised clusters in terms of configuration management.

lcfg-xen seems to be something that runs on a dom0 to manage guests in an integrated way with the rest of the LCFG system.

Handles out-of-band managed guests, LCFG managed guests, and templating for both the basic disk image and the LCFG profile.

Can't handle live migration yet but it's clearly doable to write an additional lcfg component to handle that - presumably the first person to need it will build it.

  #include <lcfg/hw/xen_vm.h>

There's a screen full of config here - basically seems to boil down to the unrolling into a real config of:

!xen.<resource>_<vm>

so for e.g.

  !xen.virtualmachines   mADD(login1)
  !xen.name_login1       mSET(login1)
  !xen.memory_login1     mSET(1024)

(I was confused as to why the name was last, but apparently it's because there weren't names originally so the hierarchy sort of goes right to left)

etc. etc.

  #define _MYSQL_USER mysql
  #include <lcfg/options/mysql-server.h>

This is all quite beautiful in a sort of Old School UNIX way. I somehow doubt LCFG will be winning any rails fanboi converts today but these slides are seriously config heavy and there's not been a single thing I can't immediately see what's doing so far.

I am a little confused as to why the vm name's a suffix rather than a prefix within the xen stuff though - will have to annoy him in the questions.

Now in development: lcfg-libvirt - generalising the principle out to all libvirt supported systems. KVM and Xen was easy, but OpenVZ would take more work because its conceptualisation of networking is significantly different.

Assuming the slides for this thing are online they'll be well worth a look - the diagrams are really rather enlightening.

Overall conclusion: LCFG clearly has its own share of hysterical raisins but is holding up pretty damn well on the whole.

... oh, and we finished 15 minutes early somehow, I think perhaps the presenter overestimated how much material he had. On the other hand, I don't think he's a veteran and he didn't read off the damn slides - so I'm voting for an A- just for not making that mistake.

Asking a bit more about libvirt - it appears it mostly abstracts the basic stuff but things like bridged networking are still pretty virtualisation style specific. Abstractions: leaky as always.

----

Herding a MySQL cluster at scale at google

The presenter appears to have ... loaded hamster dance.

I don't understand.

I'm not sure I *want* to understand.

"I'm not just here to tell you how we do it, I'm here to encourage anybody else who's been afflicted with MySQL to join in the group hug."

If they can't write to the db money starts to be lost very very quickly.

RDBMS *isn't* the default inside google - "they ask 'why aren't you using BigTable instead?' ... or anything else they just happen to have written".

Heavy heavy sharding to maintain write speeds.

Multiple replicas within a single region, but also replicate out to other regions - if a meteorite takes out a datacentre that's no excuse for downtime.

Relational DB as a known solution - Google was much smaller when this was chosen so getting developers up to speed faster was much more important than now.

"Industry standard". Or at least RDBMS is; the presenter happily isn't claiming that for MySQL.

The disadvantage of using a well known easily accessible solution, of course, is that lots of people can then access it and write stuff that talks to it.

MySQL benched 3 times faster than Postgres for their workloads, though "there was probably less of that data integrity thing going on, which probably helped" - worth dealing with some integrity at the application layer (even on InnoDB) for that performance level.

"Oracle: slow, complicated, expensive, choose 3"

Also, way way way too pricey for an implementation at the scale they needed.

MySQL cluster was interesting but when they looked you couldn't change the schema on the fly; apparently BigTable is much harder to do schema changes on.

The screen just went blank. Apparently the presentation itself isn't high availability. "it's all in the cloud, don't blame me" quoth the presenter.

The slide says monitoring and avaibility. The presenter says "how do we keep this whole heep of junk in the air?". I like the verbal version better.

They have a daemon running alongside mysqld doing a SELECT 1 every so often and dumps it out the replica list if something goes wrong - way too many machines to really care if one disappears, do want to page out if half a dozen disappear at close to the same time. New errors page people too - if it's an unknown failure mode it needs to be turned into a known one. Then ignore it.

On master failure they promote a nearby replica and rewire replication to talk to that.

Of course, you need to shoot the other node in the head. They do this by sshing in and removing the mysql binary altogether. If that doesn't work then it gets turned off at the switch, if *that* doesn't work then take it out via percussive maintenance (this slide contains a big picture of a hammer).

Replication is semi-synchronous -

Two threads involved in replication - one that runs I/O slurping slave logs in over the network, second applies the SQL stuff.

They've tweaked it so that the slave tells the master as soon as it's got the data onto disk but before it's replied, on grounds that that means if the master vanishes they've probably got the change applied *somewhere*.

"We can be pretty sure that a single master will have its data replicated by the time it's dead".

Except in case of meteorite, since the replica could've been in the same DC. And when failing over to another DC means you have to take your latency sensitive write clients with you to the new location.

Replica failures happen so often that they take it out of DNS, fire up another one, bring that up from backups and let it get back up to speed and rotate that one back in. Then the old one is flattened and sent to the datacentre team to fix up.

Clients need to either re-resolve DNS a lot or recover by dropping all their connections and starting again.

... sometimes the failure system marks 150 replicas dead by mistake and the DC operations guys have rebuilt them all and rotated them back into service before the MySQL team has a chance to say "oh, sorry, we could have these back".

Reading clients are split into three sets:
- High-value latency-sensitive (frontends)
- High-value batch
- Everything else

All three get their own DNS aliases and generally each cluster is separated, especially since the "everything else" services get all sorts of weird stuff attached to it from loud but not business critical users.

They've deleted sproc and subquery support from MySQL entirely. Also triggers.

Backups don't exist for normal backup purposes - "we have enough servers it would have to be a really *interesting* sort of outage to need them for that".

Shut down, tar up the whole thing, throw it onto GFS - use that as a checkpoint to start from when provisioning a new replica.

Question: "What do you do to guard against UPDATE without WHERE?"
Answer: "Have you talked to HR about this?"

Non-scripted writes to primaries are code reviewed before being run.

They used to keep a replica an hour behind but didn't need it after two years so turned it off.

Transaction deadlock? Reboot it. It's probably simpler.

They're still using it because: lots of things rely on it and it's better to have a single (albeit annoying) source of truth than many potentially conflicting sources of lies.

On preventing proliferation: "We have vast rays of disapproval every time somebody mentions the word MySQL" - which of course works extremely well at google. Additionally it's approximately 3x more expensive to manage for a roughly equivalent deployment compared to the google internal stuff.

Q: "Why is that?"
A: "Because MySQL is a flaky lump of poo?"

On moving to something else with an SQL like language: "We've learned the lesson of letting analysts make their own queries." - i.e. probably not.

----

Aaaand we're done for the day. More tomorrow.

-- mst, out.