Wednesday 30 November 2011

Encourage cross department suggestions/initiatives

In companies, due to naturally occurring internal competition, there is many barriers to cross department/field suggestions and initiatives.  Sometimes an outside view can be advantageous.  The view from somebody with some insight but don’t work with it normally.  Why go to an outside consultant when there is probably many within your own organisation that has ideas on the team but no way of exploring them.
 
Setup an internal discussion forum where ideas can flow across natural divisions.  In this day of the web there is no need of having a meeting about it where the ones that likes to hear their own voice rather than have something useful to say is most likely to rule the roost.  Sometimes its the quiet thinker that has the deepest thoughts. 
Filter and let somebody from the department, who’s responsibility the field is, to spend a little bit of time now and then to mull over the suggestions, and argue against them / state why they are impractical if needed.  It will help answer the question of why don’t we do thins this way or that way and drive the whole organisation towards the stated goal implying a great sense of understanding and inclusiveness. 
After all what is the cost of lending an ear to new ideas, except for a little time, and the gains could be significant.

This could be especially valuable in a customer facing organisation, or one that wants to be customer friendly.  The flow of info from the customer to your business do not always come through the planned channels in this area of online social networking.  Much can be gathered via sites like twitter, facebook and google+  Many companies “can’t afford” to monitor these sites, or they who do monitor are not in the right circles.  You have a whole workforce that use these media in their private time tough.  Utilize this resource.

This approach does require that the management of the company sees the employees as a resource and not just as a cost to be minimised.  There is many talented people out there that given the right opportunity could shine even if unexpectedly.   The first thought when finding an employee not thriving in their current position should be to  see if they could be a better fit somewhere else now when the organisation has learned their strengths (and weaknesses).

Switching an art in change


When will Cisco move on from the dark ages of command line and create a graphical interface that can handle all flavours of its hardware.  Or is the key to its “popularity” that it requires a specialist to handle it.  In such a way that every company of any size has one that de facto becomes the networking specialist and therefore has a say in what is purchased, upholding the status quo. 
Other platforms like 3com could be handled by any admin thanks to it realisation that we live in a windows world.  But the admin didn’t need to become a “networking specialist”, meaning didn’t need to do the dark art of programming from the command line.  And therefore was not seen as the networking guru.  It was a sad day and the beginning of the end, when 3com tried to make their interface more like  Cisco’s.  That is one thing HP should not follow up on after they bought the company.

Switching is into a revolution with the advance of blade servers.  Large companies would before merge all their standalone switches into large chassis creating a single unified unit for switching.  With the blade more of the single server connections is handled internally, and only the central part is done by a separate switch.  Here there is a task for the server vendors to have a separate but integrated choice of 10gb switches available.  And I am talking copper here.  Fiber is vulnerable to kinks  over short distances and dirt on the connections. = Best suited for longer distance communication.  Like building to building or campus to campus or longer.  For Within the room or within the floor there is nothing that beats the simplicity and the standardisation of the cats.  Though 10gb is not quite there yet when it comes to standards.  Special cables for each manufacturers equipment is not the way to go if you want your solution to spread wide.

When you do get 10gb in, you have the challenge of utilizing it.  And that include monitoring that you do reach the possible speed.  Now we are talking server to storage and racks of other media that would before have been depending on fiber for above 1gb connectivity.   Second by second monitoring is required and I can recommend Utilwatch.  You’ll be lucky if you see even 2gb/sec utilization so there is a lot to be gained for hw manufacturers in ramping up the performance of their equipment.  You can help by getting ssd disks, discussed in the article “SSD a step towards instant computing”

Since we mentioned copper versus fiber and iscsi.  Who let the fiber boys hijack the convention for iscsi node naming.  It would have been much more convenient if this was done to the ip standard rather than the complicated naming concotion of the fiber.  If copy and paste is not your friend, due to 2 separate systems with security between them, you are out with the pen&paper to transfer connection data from server to storage and vice versa.   

Tuesday 29 November 2011

HW support in a time critical environment

Not long ago hw support on your critical servers meant that when you called the engineer out he arrived with a boot full of parts. This meant that when whatever part you thought faulty was changed.  And if it didn’t fix the problem he would try a number of other possibilities.  This equalled  the engineer was fast on site and then able to do the diagnostics and rectification in a single fast swoop.  How things have changed, and not for the better. 

The callout takes a lot longer to accomplish now.  First you might have to talk to the, outsourced to a third world callcentre.  If your company is English speaking and the support centre’s native language is not you’ll get by if at least one of the parties do have that as a primary language.  Problems start building when none of the 2 parties has the common language as their primary.
Next you will have to do a lot of diagnostics to pinpoint exactly the faulty part, because that is the only thing that will be sent to you.  And yes I did say sent because these days the part comes directly from an outsourced supplier and not with the engineer.   Meaning the engineer will want to ensure that the part is onsite before he/she.  Just so he/she won’t waste any time, as if that was better than to waste yours.  Expect to waste at 1-2 hours from part arrives to engineer arrives. 
If that part was not the only failed item, the process starts over, but this time hopefully helped by the engineer now onsite.  Unless he/she decides that the next part is unlikely to come inside his/her  duty hours and sneaks out the back door.
And remember in all this, the contracted max onsite response time often only starts ticking from when the problem has been diagnosed by phone and the part/engineer is being dispatched.  This often result in that a 4 hour onsite promise is a multi hr diagnostics per telephone and for diagnostics instructions and files to fly bback and forth, and then up to 4hours for the part/engineer to come to site.  

There is also a tendency for hw suppliers to see all means of transportation as having to function for their distribution.  So for rare or just very new types of systems this might mean the missing part has to be flown to the destination.  Don’t expect that to happen if another cloud of ash darkens the sky..  Or what if your hw is broken due to activity that has stopped air traffic, like 9/11.  

Is it not incredible that many hw suppliers has a problem identifying your specific setup every tiem you call them.  Even if that server is the only one you have from that specific manufacturer, you can be sure that every time you call them you have to give them serial numbers and partnumbers,  instead of they just looking up your company name and say “yes we can see it here on our system”.
Vendors need on their internal systems to come up with a  way of giving systems the customers  name for it.  This need to be part of after sales, a much neglected area.  For many hw vendors there is no such thing as “after sales”.  This is completely handled by support, and they are reactive, meaning they only kick in when a problem occurs and the customer contacts them.  Somewhere in between there needs to be something extra.  And outsourcing it to an agent do not work.  They only get paid for sales, and won’t be directly affected if support has issues.

Monday 28 November 2011

Cooling in a damp climate

On the other hand you have the problem of cooling such a concentrated hotspot.  And air conditioning is not of the most stable devices.  Your indoor environment is sensitive to the smallest bit of sunlight and the outdoor units are very vulnerable all together. It ends up spending a lot of time de-icing so see to that your runoff is adequate.  Can be a problem when your fire extinguishing system needs a completely sealed room.  And your insurance, for it to be pressure tested.
A cool but humid climate is not always the best for a datacenter.  Yes you need to run your airconditioning slightly less but you get a lot more de-icing issues.  One of the reasons reverse cycle airconditioning for home heating never gained popularity in Ireland.  Compared to colder but much drier climates like Scandinavia.  If you have a weather station with a separate outdoor unit you will know what I mean.  They spend a lot of the time showing a humidity error because of very high values.  

Underfloor cooling was meant for network racks where the passage through the rack is unobstructed due to the shortness of the equipment.  Full length servers make blockages for the flow of the air through the rack so it’s better to give it cold air at the front and remove the hot air from the back of the rack.  This way you create cold hallways in front of racks and hot hallways behind racks.  If you have several rows of racks this do require that every second is turned the opposite way, avoiding that one servers hot exhaust becomes another’s cooling air intake.
A downside of hot and cold aisles is that where you are most likely to work, at the front where the console is, is also the place where there is a constant cold draft.  You could alternatively place the consoles at the back. It eases the cabling. These days it’s more normal to remote control the whole room so there is little need for direct human access.  And you could also increase the general temperature of the room slightly. Rather than set it at 19c you could experiment with 22c.

Few will run their cooling via the ups due to the large power demands and the resulting shortening of ups running period at time of grid failure.  If your computer room has generator backup, you will need to restart your cooling with the generator.  Lack of cooling will make your equipments internal fans increase in speed as room temperature goes up, eventually overloading fuses and cables. 
You can temporarily rectify the situation by pumping cold air in from the outside or redistributing the air already in the room better by a dedicated fan and an extendable tunnel, easily and cheaply bought from a hardware store like MachineMart.  
  
Due to the vulnerable nature of airconditioning you will need to overdimension.  You should have at least enough that 1/3 of the cooling capacity can be offline for maintenance and you are still able to keep the temeperature within range. 
It can sometimes be difficult to spot a failing airconditioning.  Simple filter or other error messages on the control panel is mostly self explaining, but sometimes you have a rise in temperature without any message.  Check the exhaust for that it’s actually cold.  Sometimes they keep on running but just blow out thes same air at the same temperature as it went in.  Specially if the outdoor part of the unit has failed.

I will again point out the importance of an environment monitor. They are relatively cheap for what they protect and the same one that monitors your power can also monitor the room temperature.  Place sensors in several different positions since it’s highly unlikely to be a uniform temperature in the whole room.  And single failures can result in hotspots.

Sunday 27 November 2011

Explosion in power needs

In the later years there has been an explosion in the power requirement per rack.  Not long ago you got 2*16amp sockets, for a and b side, and that was it. And it was like that for 10 years.  Then came the higher density of blades where 16 servers could now fit in a space before populated by 10 or sometimes jut 5.  On top of that each server would have more cores and each chassie would have to have psu’s to cater for it’s top spec  Pretty fast you are requiring more like 4*32amp per 10u and fuses where tripping all over the place.
Yes you can power manage by limiting the power each server and chassie can use, but then you can never run at your top capacity, so why did you buy it.  You will also have startup issues if you have total power failures.

For security against the frequent failures or just scheduled maintenance of the normal power grid most companies with in-house servers has some form of a ups system.  Here the problem is they seldom last for more than 10 or 20 minutes if you are lucky.  They will be based on batteries and batteries are not a good way of storing any significant amount of power when it comes to appliances that use large amounts at 220v. 
And what can you do in let’s say 15 minutes.  It’s hardly enough time for an admin to shut down the most essential databases.  (Oracle do not enjoy a sudden and complete loss of power).  Most will use best part of that time to trigger the alert.  Here an environment monitor like Avtech is worth its weight in gold for fast sms notification.
  
Most companies above a certain size will backup their ups with a generator.  And I do say “a” because very few beside dedicated data centres that offer services to third parties, has more than 1.  What they forget is a generator is more like a car.  How sure are you that your car will start first time after standing idle for a few weeks.  Regular testing is required but most generators stand around for many years, so now we are talking about a 20 year old car.  Yes it doesn't have much mileage, but that is not always a good thing.  Diesels like to be run.
If you try to solve this by a second generator you are in for a very complicated and vulnerable fail over system, to ensure that every part is redundant.  And somewhere in the middle there will be a some sort of a vulnerable failover switch.  Remember also you don’t want to make it so complicated that it induces more risk than what you where guarding against.

You could try to get a second grid supply but in most places you will find that an actual physical separation on the supply side is nearly impossible. Competition just hasn’t got that far.  You will also run into the same problem as for a second generator, how to feed power from 2 sources.