Minkata/Foundry Outage

Discussions about the OpenUru.org Minkata test shard

Moderator: rarified

Post Reply
User avatar
rarified
Member
Posts: 1061
Joined: Tue Dec 16, 2008 10:48 pm
Location: Colorado, US

Minkata/Foundry Outage

Post by rarified »

I broke it. It's my fault. :o

The virtual host for the Foundry and Minkata are down right now. I need to do some interior network renumbering, and thought I had a configuration that would tolerate coexistance of the old and new network on the same wire. I was sadly mystaken.

Working on it, but RL work needs some time today so it may be until tonight I get my act together.

_R
One of the OpenUru toolsmiths... a bookbinder.
User avatar
rarified
Member
Posts: 1061
Joined: Tue Dec 16, 2008 10:48 pm
Location: Colorado, US

Re: Minkata/Foundry Outage

Post by rarified »

Ummm, we're still down. But...

I've got the whole day today to dedicate to upgrading infrastructure to finish the network changes (aside from lunch and the canines' demands).
Hoping it will be enough 8-)

_R
One of the OpenUru toolsmiths... a bookbinder.
User avatar
rarified
Member
Posts: 1061
Joined: Tue Dec 16, 2008 10:48 pm
Location: Colorado, US

Re: Minkata/Foundry Outage

Post by rarified »

A quick update now that I finished some RL tasks and have had a little sleep.

Minkata/Foundry is back up at the moment; and I found the problem to be a nasty hardware issue with two CPU chips running on my server. With only one CPU chip the system is back to it's usual stable self.

However, I need to undo all of the tangled mess in the networking I made while trying to isolate the problem, so there will be some times in the near future where Minkata and the Foundry (and, unfortunately all my other work and personal server stuff) need to come down again and back up.

So if you find Minkata up and want to use it feel free to do so, but be aware for the next few days it will probably go down with little or no warning.

People sometimes wonder why progress on user contributions is so slow particularly getting into and out of the test shard. Events like this are some of the reasons for that. :cry:

_R
One of the OpenUru toolsmiths... a bookbinder.
User avatar
JWPlatt
Member
Posts: 1137
Joined: Sun Dec 07, 2008 7:32 pm
Location: Everywhere, all at once

Re: Minkata/Foundry Outage

Post by JWPlatt »

I like hardware problems because it means it's not my code's issue, except perhaps to the extent that I should consider handling faults. But with a CPU problem, heh, I wash my hands of all responsibility.

Thanks, rarified!
Perfect speed is being there.
User avatar
Mac_Fife
Member
Posts: 1239
Joined: Fri Dec 19, 2008 12:38 am
Location: Scotland
Contact:

Re: Minkata/Foundry Outage

Post by Mac_Fife »

As I recall, there was problem (a long while back) that appeared when upgrading CPUs and appeared to be a faulty socket. I guess that was never resolved and the Foundry has run on three CPUs ever since. Now it looks like we're down another two chips? Or am I misinterpreting?
Mac_Fife
OpenUru.org wiki wrangler
User avatar
rarified
Member
Posts: 1061
Joined: Tue Dec 16, 2008 10:48 pm
Location: Colorado, US

Re: Minkata/Foundry Outage

Post by rarified »

Mac_Fife wrote:As I recall, there was problem (a long while back) that appeared when upgrading CPUs and appeared to be a faulty socket. I guess that was never resolved and the Foundry has run on three CPUs ever since. Now it looks like we're down another two chips? Or am I misinterpreting?
Good memory... Yes, earlier there was a problem when I tried to add a second processor (chip) to the machine at that time (a dual AMD Opteron system). With the AMD architecture memory is populated local to each processor socket. The problem turned out that the first rank of memory sockets had a bent contact, and because the first rank is required to be filled to be able to use the processor, I had to leave that socket empty. Which wasn't a catastrophe since the processors I had at the time were 4-core Opterons, so I was running 4-cores from the remaining socket.

Since then (during an update early in the summer) I had found an identical board from new-old-stock on eBay, and used it to replace the existing processor board, as well as update to 6-core opterons (for a total of 12 cores). Which has run solid until a couple of weeks ago.

Dropping to one processor and moving all the memory to the remaining processor ranks seemed to stabilize the system, but it now may turn out to be memory problems rather than CPU problems. The system is reporting sustained ECC errors on one of the moved memory banks. This time it looks like Solaris can at least keep running and disable that bank of memory rather than crashing (probably as the moved memory is no longer containing part(s) of the OS).

I have just received some new DIMMs from Kingston (sadly no longer commodity priced as time has marched on) and have some more used memory coming from another eBay seller Wednesday. So Wednesday evening I'll probably take the box down again, run some longer memory diagnostics to confirm the existing memory is bad, plugging in the new memory and running more memory diagnostics to make sure the new stuff works (still on one processor chip), and run for a few days to burn that in. Assuming that all works successfully I'll try next weekend to reinstall the second CPU and redistribute the memory between the two processor chips again.

Where is that service contract again? 8-) I love fixing hardware...

_R
One of the OpenUru toolsmiths... a bookbinder.
User avatar
rarified
Member
Posts: 1061
Joined: Tue Dec 16, 2008 10:48 pm
Location: Colorado, US

Re: Minkata/Foundry Outage

Post by rarified »

Just to close out this thread, I believe the server, networking and services are now back and running.

I still have one hardware issue to resolve, but the current configuration is at least stable so that issue has left the top of the to-do list. :roll:

I may actually get some Uru stuff done during the holiday break! :shock:

_R
One of the OpenUru toolsmiths... a bookbinder.
User avatar
JWPlatt
Member
Posts: 1137
Joined: Sun Dec 07, 2008 7:32 pm
Location: Everywhere, all at once

Re: Minkata/Foundry Outage

Post by JWPlatt »

Nice work. Thanks!
Perfect speed is being there.
Post Reply

Return to “OpenUru.org Minkata Test Shard”