Thursday 1 December 2011

Importance of monitoring what you have outsourced

Your outsourced system is never as important for your supplier as it is for you.  Most contracts has check times counted in minutes, and by the time the set amount of alarm has been triggered, to avoid false positives and an operator has been alerted 15 minutes can easily have gone.  And 30 minutes or more before anybody takes it in hand.  Since you squeezed the price you pay for the service down to the absolute minimum the agreed penalty is seldom in relation to what the outage means financially to your organisation.
Another reason for doing your own monitoring is that it will give you the unmasked truth.  Do you trust your supplier to always tell you what’s going on.  Is their answers at times vague or slow forth coming.

The easiest way to see traffic is by network monitoring. A simple network graph from a tool like Utilwatch will give you second by second information, and can run on the cheapest oldest pc you have.  If it’s running in the background but within your field of vision you will immediately know if something is amiss.  Experience lets you interpret the data better.  You can also via simple scripts create easy traffic-lights.
Cheap second by second tools do however seldom store the data. They are wysiwyg.On screen current display only. You seldom need to store this much data though.  The interpretation is dependent of other factors at the time.  Like did you start/stop something.  Where your web caches reloading,  Was the blip due to a scheduled maintenance.  A simple screenshot will capture the moment for later inclusion in a manual log together with comments.  

There is also many tools that let you set up triggers and alarms to your own liking.  I would pick at least one that isn’t from the supplier of what you try to monitor.  If the supplier know how to monitor it / trigger the alarm they would/should have fixed the problem in the first place. 
Some like ipmonitor is also cross platform, and store the history of previous alarms if configured correctly.  If your urgency is lower in priority, and/or your problem is outside normal hours tools like Cacti will give you a view of last nights/weeks/months proceedings.

If you don’t feel like spending time or effort on monitoring yourself but still see the value of an outside eye on your hosting/network/resource provider there is many third party suppliers that will happily let you try before you buy their monitoring services.  But then you are back to the 15-30  minutes instead of seconds response again.

Wednesday 30 November 2011

Encourage cross department suggestions/initiatives

In companies, due to naturally occurring internal competition, there is many barriers to cross department/field suggestions and initiatives.  Sometimes an outside view can be advantageous.  The view from somebody with some insight but don’t work with it normally.  Why go to an outside consultant when there is probably many within your own organisation that has ideas on the team but no way of exploring them.
 
Setup an internal discussion forum where ideas can flow across natural divisions.  In this day of the web there is no need of having a meeting about it where the ones that likes to hear their own voice rather than have something useful to say is most likely to rule the roost.  Sometimes its the quiet thinker that has the deepest thoughts. 
Filter and let somebody from the department, who’s responsibility the field is, to spend a little bit of time now and then to mull over the suggestions, and argue against them / state why they are impractical if needed.  It will help answer the question of why don’t we do thins this way or that way and drive the whole organisation towards the stated goal implying a great sense of understanding and inclusiveness. 
After all what is the cost of lending an ear to new ideas, except for a little time, and the gains could be significant.

This could be especially valuable in a customer facing organisation, or one that wants to be customer friendly.  The flow of info from the customer to your business do not always come through the planned channels in this area of online social networking.  Much can be gathered via sites like twitter, facebook and google+  Many companies “can’t afford” to monitor these sites, or they who do monitor are not in the right circles.  You have a whole workforce that use these media in their private time tough.  Utilize this resource.

This approach does require that the management of the company sees the employees as a resource and not just as a cost to be minimised.  There is many talented people out there that given the right opportunity could shine even if unexpectedly.   The first thought when finding an employee not thriving in their current position should be to  see if they could be a better fit somewhere else now when the organisation has learned their strengths (and weaknesses).

Switching an art in change


When will Cisco move on from the dark ages of command line and create a graphical interface that can handle all flavours of its hardware.  Or is the key to its “popularity” that it requires a specialist to handle it.  In such a way that every company of any size has one that de facto becomes the networking specialist and therefore has a say in what is purchased, upholding the status quo. 
Other platforms like 3com could be handled by any admin thanks to it realisation that we live in a windows world.  But the admin didn’t need to become a “networking specialist”, meaning didn’t need to do the dark art of programming from the command line.  And therefore was not seen as the networking guru.  It was a sad day and the beginning of the end, when 3com tried to make their interface more like  Cisco’s.  That is one thing HP should not follow up on after they bought the company.

Switching is into a revolution with the advance of blade servers.  Large companies would before merge all their standalone switches into large chassis creating a single unified unit for switching.  With the blade more of the single server connections is handled internally, and only the central part is done by a separate switch.  Here there is a task for the server vendors to have a separate but integrated choice of 10gb switches available.  And I am talking copper here.  Fiber is vulnerable to kinks  over short distances and dirt on the connections. = Best suited for longer distance communication.  Like building to building or campus to campus or longer.  For Within the room or within the floor there is nothing that beats the simplicity and the standardisation of the cats.  Though 10gb is not quite there yet when it comes to standards.  Special cables for each manufacturers equipment is not the way to go if you want your solution to spread wide.

When you do get 10gb in, you have the challenge of utilizing it.  And that include monitoring that you do reach the possible speed.  Now we are talking server to storage and racks of other media that would before have been depending on fiber for above 1gb connectivity.   Second by second monitoring is required and I can recommend Utilwatch.  You’ll be lucky if you see even 2gb/sec utilization so there is a lot to be gained for hw manufacturers in ramping up the performance of their equipment.  You can help by getting ssd disks, discussed in the article “SSD a step towards instant computing”

Since we mentioned copper versus fiber and iscsi.  Who let the fiber boys hijack the convention for iscsi node naming.  It would have been much more convenient if this was done to the ip standard rather than the complicated naming concotion of the fiber.  If copy and paste is not your friend, due to 2 separate systems with security between them, you are out with the pen&paper to transfer connection data from server to storage and vice versa.   

Tuesday 29 November 2011

HW support in a time critical environment

Not long ago hw support on your critical servers meant that when you called the engineer out he arrived with a boot full of parts. This meant that when whatever part you thought faulty was changed.  And if it didn’t fix the problem he would try a number of other possibilities.  This equalled  the engineer was fast on site and then able to do the diagnostics and rectification in a single fast swoop.  How things have changed, and not for the better. 

The callout takes a lot longer to accomplish now.  First you might have to talk to the, outsourced to a third world callcentre.  If your company is English speaking and the support centre’s native language is not you’ll get by if at least one of the parties do have that as a primary language.  Problems start building when none of the 2 parties has the common language as their primary.
Next you will have to do a lot of diagnostics to pinpoint exactly the faulty part, because that is the only thing that will be sent to you.  And yes I did say sent because these days the part comes directly from an outsourced supplier and not with the engineer.   Meaning the engineer will want to ensure that the part is onsite before he/she.  Just so he/she won’t waste any time, as if that was better than to waste yours.  Expect to waste at 1-2 hours from part arrives to engineer arrives. 
If that part was not the only failed item, the process starts over, but this time hopefully helped by the engineer now onsite.  Unless he/she decides that the next part is unlikely to come inside his/her  duty hours and sneaks out the back door.
And remember in all this, the contracted max onsite response time often only starts ticking from when the problem has been diagnosed by phone and the part/engineer is being dispatched.  This often result in that a 4 hour onsite promise is a multi hr diagnostics per telephone and for diagnostics instructions and files to fly bback and forth, and then up to 4hours for the part/engineer to come to site.  

There is also a tendency for hw suppliers to see all means of transportation as having to function for their distribution.  So for rare or just very new types of systems this might mean the missing part has to be flown to the destination.  Don’t expect that to happen if another cloud of ash darkens the sky..  Or what if your hw is broken due to activity that has stopped air traffic, like 9/11.  

Is it not incredible that many hw suppliers has a problem identifying your specific setup every tiem you call them.  Even if that server is the only one you have from that specific manufacturer, you can be sure that every time you call them you have to give them serial numbers and partnumbers,  instead of they just looking up your company name and say “yes we can see it here on our system”.
Vendors need on their internal systems to come up with a  way of giving systems the customers  name for it.  This need to be part of after sales, a much neglected area.  For many hw vendors there is no such thing as “after sales”.  This is completely handled by support, and they are reactive, meaning they only kick in when a problem occurs and the customer contacts them.  Somewhere in between there needs to be something extra.  And outsourcing it to an agent do not work.  They only get paid for sales, and won’t be directly affected if support has issues.

Monday 28 November 2011

Cooling in a damp climate

On the other hand you have the problem of cooling such a concentrated hotspot.  And air conditioning is not of the most stable devices.  Your indoor environment is sensitive to the smallest bit of sunlight and the outdoor units are very vulnerable all together. It ends up spending a lot of time de-icing so see to that your runoff is adequate.  Can be a problem when your fire extinguishing system needs a completely sealed room.  And your insurance, for it to be pressure tested.
A cool but humid climate is not always the best for a datacenter.  Yes you need to run your airconditioning slightly less but you get a lot more de-icing issues.  One of the reasons reverse cycle airconditioning for home heating never gained popularity in Ireland.  Compared to colder but much drier climates like Scandinavia.  If you have a weather station with a separate outdoor unit you will know what I mean.  They spend a lot of the time showing a humidity error because of very high values.  

Underfloor cooling was meant for network racks where the passage through the rack is unobstructed due to the shortness of the equipment.  Full length servers make blockages for the flow of the air through the rack so it’s better to give it cold air at the front and remove the hot air from the back of the rack.  This way you create cold hallways in front of racks and hot hallways behind racks.  If you have several rows of racks this do require that every second is turned the opposite way, avoiding that one servers hot exhaust becomes another’s cooling air intake.
A downside of hot and cold aisles is that where you are most likely to work, at the front where the console is, is also the place where there is a constant cold draft.  You could alternatively place the consoles at the back. It eases the cabling. These days it’s more normal to remote control the whole room so there is little need for direct human access.  And you could also increase the general temperature of the room slightly. Rather than set it at 19c you could experiment with 22c.

Few will run their cooling via the ups due to the large power demands and the resulting shortening of ups running period at time of grid failure.  If your computer room has generator backup, you will need to restart your cooling with the generator.  Lack of cooling will make your equipments internal fans increase in speed as room temperature goes up, eventually overloading fuses and cables. 
You can temporarily rectify the situation by pumping cold air in from the outside or redistributing the air already in the room better by a dedicated fan and an extendable tunnel, easily and cheaply bought from a hardware store like MachineMart.  
  
Due to the vulnerable nature of airconditioning you will need to overdimension.  You should have at least enough that 1/3 of the cooling capacity can be offline for maintenance and you are still able to keep the temeperature within range. 
It can sometimes be difficult to spot a failing airconditioning.  Simple filter or other error messages on the control panel is mostly self explaining, but sometimes you have a rise in temperature without any message.  Check the exhaust for that it’s actually cold.  Sometimes they keep on running but just blow out thes same air at the same temperature as it went in.  Specially if the outdoor part of the unit has failed.

I will again point out the importance of an environment monitor. They are relatively cheap for what they protect and the same one that monitors your power can also monitor the room temperature.  Place sensors in several different positions since it’s highly unlikely to be a uniform temperature in the whole room.  And single failures can result in hotspots.