| I see dead 
		systems What prompted this article was a recent trip 
		to a data center. I have client's equipment installed there and 
		discussed an incident that happened back in the summer with people who 
		work there. Last July, I received an alert warning. I got one in my main email box (pictures are thumbnails, click on 
		them to see the full size screen snip): 
		 I got one in my backup email box on gmail: 
		 I got a text message on my phone: 
		 Now this server didn't crash. This server wasn't in any way shape or 
		form impacted by these alerts.... YET... This was like your kid coming into the room in the middle of the 
		night, poking you in the stomach, and saying "Daddy, my tummy hurts." Maybe it is nothing. Maybe he ate too much ice cream at the birthday 
		party he attended earlier. Maybe it is significant. Maybe it is appendicitis. But if your door is locked and he never comes in and never tells you 
		... think of the potential consequences. Things could turn very bad very 
		quickly, even passing the point of no return. 100% easily preventable. I'll circle back to the alert above at the end of this article and 
		why this was so critical. Here are a few more alerts I've received throughout the years from 
		various clients or my own systems. I'll just show you the text message 
		versions instead of every flavor, but I have these set to alert me in 3 
		different ways along with the owner and / or technical contact at their 
		company. 
		 Up top, someone opening up the case. This happened at my direction, 
		so no problem there. Next a disk failure. Every hard drive will fail at some point in its 
		life, taking all the data with it. RAID protects you from a hard drive 
		failure ONLY IF you replace the failed drive and let the redundancy 
		rebuild. 2nd hard drive fails, now you've got data loss. Yes, there are 
		ways of mitigating against N hard drive failures, but at some point if 
		you aren't watching and fixing things the N+1 hard drive will also fail 
		and you'll be in the same data / system loss situation. Drive replaced, system kept going without even needing a reboot! The bottom was a power supply looking over the edge of a cliff, ready 
		to jump off. Power supply replaced, system kept going without even a 
		reboot. This next one is a great story, with a timeline:  
		 Sunday, 11/6/2011: 3:56 PM - power supply throws an error, I received various messages, 
		remotely check in and confirm the failure: 
		  I called the client's cell, he was on the road somewhere in 
		Michigan driving back to their home office in Wisconsin. It is an older 
		server, no spare parts, no service contract. Knowing his location, I 
		know there are a few ways he can travel back to Wisconsin that aren't 
		that much different in distance from each other. 4:09 PM - I've located a used server for sale on eBay that is along 
		one of his paths home. Sent the seller a message asking if it could be 
		picked up in the next few hours. He calls me, I explain the situation, 
		get detailed directions, and have everything arranged. 4:31 PM - Purchase confirmation from Paypal. 4:42 PM - I've finished writing up all the details to email the 
		client so they have all the pertinent details at hand. 4:43 PM - On the phone with the client, verbal instructions for what 
		is already in the email... since you shouldn't be checking email while 
		you are driving! Instead of going home via I94, instead they'll take 
		294, exit, couple of turns, make a phone call, and the server will be 
		brought to the back of his truck. Then on with his journey home. 7:30 PM - Server is brought to his truck, he continues with his 
		journey. Monday, 9:00 AM: I talk a non-technical office staff member through 
		swapping the power supply. Takes 5 minutes tops. Server doesn't even need to be rebooted. No lapse in service. That spare server serves as an organ donor for a few more parts until 
		the server is eventually retired and replaced. That spare parts organ donating server cost only $200! That is the kind of service everyone should get!How do I get 
		that failure monitoring for my systems? Great question! If I've installed 
		your server, you probably already have it. If someone else installed it, 
		who knows? Next time you are at your data center or walk into your computer 
		room, find one of the servers that has dual power supplies and unplug 
		one of them. Go ahead! Now wait. See how long it takes for someone to show up and check if 
		there is a real problem or if it is a false alarm. If nobody ever shows 
		up, nobody calls, nobody cares, then you've got a problem waiting to bite your sensitive parts.Geeky stuff 
		alert - skip if you aren't technical: This isn't a full walk through, more a 50,000 
		foot high view into "How you get this easy and free" On a Dell Poweredge, you must install and configure Dell's 
		OPENMANAGE™ Server Administrator as shown here on one of my 
		servers: 
		 Configure all the alerts to run an application, then write a 
		batch file to email you and anyone else in the event of an emergency. 
		Here I've done it for the "Power Supply Critical" event: 
		 I use BLAT.EXE to email me and my client at various addresses using 
		the text message gateway to also send the alert to my phone via a text 
		message. HP has a similar program, HP Systems Insight Manager with pages and 
		pages of various alert possibilities, here is a snip from one of the 
		last pages: 
 So CONFIGURE THOSE ALERTS! The data (and job) you save may be your 
		own!It is 10:00 
		PM. Do you know that YOUR servers are healthy? I know 
		mine are! I know any I've installed for my clients are! What brought this on - I was at the data center and could hear a server 
		beeping every couple of seconds. As loud as it is there, I could still 
		hear the beeping. From the frequency and cadence of the beeping, I can 
		tell it is a Dell server with a drive failure. Somewhere somebody is going to have a very bad day if 
		they don't fix it soon. The server wants to be heard! It wants to be 
		well! It is crying ... and nobody was listening. I informed the 
		management staff, they contacted the client, and when I went back two weeks later the server 
		wasn't beeping anymore. The alert I opened the article with - the temperature probe alert - 
		turned out to be a problem with a cooling unit at the data center. They 
		were already on it, had setup a temporary backup cooler until the main 
		unit could be replaced on Monday. I asked "How many other people called 
		asking if there is there a problem at the data center or do I have a 
		server that is overheating and I need to investigate?" Of all the clients, only one other called. This tells me lots of servers out there are not configured to call 
		out for help when they have problems. Why??? I can't think of one good reason this wasn't done. If this 
		configuration takes someone more than an hour, they should not be 
		setting up servers. If you would like the same features on your systems, I would love to help out.  Full contact infomration is at the Contact Us link on the top left of 
		this page, and I can be reached at:das (at-sign) dascomputerconsultants (dot) com
 
 Thanks for stopping by!
 
 David Soussan (C) 2015 DAS Computer Consultants, 
LTD.  All Rights Reserved.     |