Page 1 of 1

Managing our shared resources

Posted: Fri Jun 12, 2009 4:31 am
by rarified
It's an all too common situation.... I went on vacation with everything I've set up as evaluation software running fine, and while I'm gone something hiccups. Anyone who tried to access that service had to wait until I returned or was able to find network access sufficient to work on the problem.

In our new Open environment, and even just within the OpenURU world, there are services the folks will start to depend upon. From the forum servers (and different services that run on those servers) to the eventually shared game servers, we'll have to decide what kind of reliability people will expect from the servers and how to administer them.

In my situation, I PM'ed a couple of folks I've come to trust and offered to act as a backup administrator for them, if they could do the same for me. But maybe there is a better way to match skills, needs, and processes. We already have a discussion thread where people listed their experience and skills (here: viewtopic.php?f=17&t=35).


Re: Managing our shared resources

Posted: Fri Jun 12, 2009 5:02 am
by rarified
Setting up resources to tolerate multiple manglers (errr, managers ;) )

One thing to consider when setting up a server that may need to be administered by someone other than the original owner is how hard it is to restore the server or service from a catastrophe or a major administration mistake. I'll outline here how I set up the server and services on my foundry server (which is a seperate server from the one that hosts the OpenURU forum), so that if I give another individual the ability to handle minor failures that they can't do anything worse than take that server offline until I return.

I offer the foundry server on standard AMD PC hardware that is runnning a variation of Unix called Solaris. One of the features of Solaris is to partition the operating system so that there can be additional "copies" of Solaris running in isolation from each other and from the main OS. Each copy looks like an additional machine (much like a Virtual Machine but with much lower overhead) -- it can have it's own network interfaces, filesystems, and software. The foundry server itself runs inside one of these Solaris copies.

The first benefit of this configuration is that I could give someone else administrative privileges to control programs running on the foundry, without that person having administrative privileges on the main Solaris copy. With some controls in place, even the most privileged user accessing this copy cannot change the network configuration or access disks or filesystems that were not granted to that instance when it was initially configured. So the worst thing an administrator could do is disable that instance and/or damage the data the instance has access to. (This also provides containment if someone on the network exploited a weakness in one of the services offered on the server and was able to gain privileged access to the underlying server).

A second configuration item that provides robustness to tolerate mismanagement of the server is that all storage assigned to the foundry server is on filesystems that offer the ability to record a "snapshot" of the filesystem at one or more points in the past (the Solaris ZFS filesystem). By taking a snapshot of the server before a major configuration change is made, before I would be absent and hand off administration to another person, and on a periodic basis, the storage can be easily reverted to an earlier state. So at most the loss of data on the server from a catastrophic event would be changes that happened since the most recent snapshot. An intruder hacks the server and damages it? Shut down the server, roll back storage to the earlier snapshot, and restart the server. Same thing for administration mistakes.

While I took advantage of some Solaris features to do this, there are similar capabilities available under Linux.

I think that by trying to design in some robustness for the server, I'm more comfortable with having others help out in the management of the server. If we start offering additional servers for tools (and Uru itself) we should keep in mind that we'll probably have to share in the administrative tasks, and try to configure things to be as robust as possible.


Re: Managing our shared resources

Posted: Fri Jun 12, 2009 5:34 am
by JWPlatt
The ability to kick-start stubborn systems we depend upon would be nice, both on external servers like yours and local resources right here. That might be a bigger deal when Foundry is doing CI, but not such a big deal for the evaluations. itself is hosted so it's pretty well taken care of, but the individual resources will need admins/managers who can go in and respond to situations. i.e. Access to the resource root and db.

My estimation of trust so far as been how people present themselves, what their skills are, professionalism, and what they've invested of themselves to advance our efforts here over time. Some people have resources of special interest. For all these reasons, Mac_Fife, for example, is the Wiki resource admin and he controls its asset store here through an FTP account. I would eventually like to see all resources administered at their root by trusted members without burdening any single member with more than is enjoyable and rewarding.

As for the skills thread, it would be handy to see the entries edited by their authors, or simply some posts here, to state the specific resources in which they might be particularly interested in developing or administering. That way, we don't have to make our own inferences.

Btw, if Foundry is fairly uncomplicated, I'd be happy to learn at least how to recover it. (I wrote this as you posted again. Very nice. In fact, if this is not part of your CI design post in Building & Testing, please include it).

Re: Managing our shared resources

Posted: Fri Jun 12, 2009 5:53 pm
by Mac_Fife
Finding backup admins is one of the things I've had to address too when I've set up web sites. The hosting service gives you a degree of protection, e.g. by automatically restarting the webserver as necessary, but you still need to have someone who can contact the hosting service when something else goes wrong and you maybe need to request a backup restore.

If is going to be used seriously by development teams then I expect we do need to be prepared to keep it supported pretty much 24/7. That's too much to ask of one or two individuals, and timezone issues can also start to come into play. JW was up real late one night, and I was up early (in the UK) while we sorted out the use of images on the wiki.

If I recall, I think JW always had a view that over time there'd be a delegation of management and administration responsibilities for the tools on (if not for the domain itself) as suitable candidates revealed themselves. I would forgive rarified if he'd taken the view that the Foundry was his "own" asset and retained all management of it. I'm guessing that the virtualisation of the Foundry was originally more about keeping the "public" Foundry firewalled from the rest of his system and the fact that it lends itself to shared administration is just a happy coincidence. Either way it's good to see the "open" attitude abounds :)

I'm pretty familiar with with MediaWiki, phpBB, Apache and mySQL under both Linux and Windows and reasonably au fait with PHP, PERL and Javascript. I like to handcraft my web pages and CSS stylesheets in a text editor rather than using CMS tools or WYSIWYG editors. So I can handle most web server related tasks. I use Solaris at work, but I'm really in the "application user" category there, and would feel out of my comfort zone in taking on any admin tasks for the Foundry.

I'm happy enough to volunteer for any admin duties around the wiki, forums or general web areas with the caveat that I can't guarantee a schedule of availability, although I can usually check on things early in the morning and during the evening (UK time).

Re: Managing our shared resources

Posted: Fri Jun 12, 2009 7:54 pm
by rarified
I wouldn't think there could be any expectation of volunteers other than on an "as available" basis. Hopefully there would be enough candidates to overlap or pick up issues with a shorter delay than one individual could handle.

With respect to the foundry, the admin tasks would be a mix of apache, mysql, java (the Hudson engine is written in Java) at the outermost layer. All of it runs as a non-privileged user and the services are monitored with the Solaris SMF facility (which just is a fancy means of restarting services that abnormally terminate, and making sure services that depend upon other services are maintained honoring those dependencies).

But peel the onion a bit and you find two VMs inside the foundry (as described in the architecture thread here viewtopic.php?f=20&t=82). One is Ubuntu 9 Linux, the other is Windows 2003 Server. Both are populated with compilers and other tools and libraries that are needed to build software constructed with the factory. So to get a slave build system back up and running will also need skills in those base systems, as well as using SSH and and RDP client to tunnel access to the VM consoles to your local system. Knowing a little about VirtualBox would help as well (again, the VMs run under the non-privileged user).

I would imagine that I'd have to write up more detail about the architecture, as well as start a list of boilerplate processes for dealing with common failure modes. That would go out on the Wiki or forum so others that might be interested in getting involved in the management (or just getting some new experiences) would have a place to orient themselves. Perhaps a rudimentary "operations manual" should be constructed for any resource with shared administration.

As for myself, I'm pretty capable on anything that looks like Unix (Solaris, Linux RH/Ubuntu). I'm passable at managing straightforward Windows Server or Desktop. Can configure web servers, mysql/postgres, most anything like that. I'm not an SQL whiz -- if something already exists that should work I usually can resolve that but I'm not familiar enough at creating schema or DB optimization (although I can do the underlying OS/hardware performance tuning.) Lots of experience with various network stacks so I can handle routing or performance issues there. And fluent in almost any scripting except Ruby (eventually I'll have time to learn!)


Re: Managing our shared resources

Posted: Fri Jun 12, 2009 9:46 pm
by Mac_Fife
rarified wrote:I would imagine that I'd have to write up more detail about the architecture, as well as start a list of boilerplate processes for dealing with common failure modes. That would go out on the Wiki or forum so others that might be interested in getting involved in the management (or just getting some new experiences) would have a place to orient themselves. Perhaps a rudimentary "operations manual" should be constructed for any resource with shared administration.
I meant to make a comment along the lines of needing to create public documentation (or at least "public" to the delegated managers), but my brain must have gone into shutdown before I hit the submit button :? For the main website I run, I maintain a sort of manual (just a text file) that the backup admins can refer to if they need to maintain the site while I'm away on business/holiday/sick: "If you edit this file here then you need to make equivalent changes in that file there", "After adding a new web page, remember to run the search engine re-indexing script as follows...", "After adding a news item, make sure the RSS news feed XML is up to date", etc. A lot of it is actually a reminder for myself for maintenance ops I only do once in a while, like applying updates to my forums.

I'd be wary of having the documentation only on the wiki, lest that happens to be the thing that goes down ;) Maybe Google Docs could be used for a backup, or we "borrow" pages on some other wiki that's independant from the domain. As this would simply be fallback data, it wouldn't breach JW's ideal of keeping the domain in control of it's assets. A second wiki would probably be the easiest option, as admins can simply export the pages from the wiki and import them into the reserve wiki.