Wednesday, 30 November 2011

Encourage cross department suggestions/initiatives

In companies, due to naturally occurring internal competition, there is many barriers to cross department/field suggestions and initiatives.  Sometimes an outside view can be advantageous.  The view from somebody with some insight but don’t work with it normally.  Why go to an outside consultant when there is probably many within your own organisation that has ideas on the team but no way of exploring them.
 
Setup an internal discussion forum where ideas can flow across natural divisions.  In this day of the web there is no need of having a meeting about it where the ones that likes to hear their own voice rather than have something useful to say is most likely to rule the roost.  Sometimes its the quiet thinker that has the deepest thoughts. 
Filter and let somebody from the department, who’s responsibility the field is, to spend a little bit of time now and then to mull over the suggestions, and argue against them / state why they are impractical if needed.  It will help answer the question of why don’t we do thins this way or that way and drive the whole organisation towards the stated goal implying a great sense of understanding and inclusiveness. 
After all what is the cost of lending an ear to new ideas, except for a little time, and the gains could be significant.

This could be especially valuable in a customer facing organisation, or one that wants to be customer friendly.  The flow of info from the customer to your business do not always come through the planned channels in this area of online social networking.  Much can be gathered via sites like twitter, facebook and google+  Many companies “can’t afford” to monitor these sites, or they who do monitor are not in the right circles.  You have a whole workforce that use these media in their private time tough.  Utilize this resource.

This approach does require that the management of the company sees the employees as a resource and not just as a cost to be minimised.  There is many talented people out there that given the right opportunity could shine even if unexpectedly.   The first thought when finding an employee not thriving in their current position should be to  see if they could be a better fit somewhere else now when the organisation has learned their strengths (and weaknesses).

Switching an art in change


When will Cisco move on from the dark ages of command line and create a graphical interface that can handle all flavours of its hardware.  Or is the key to its “popularity” that it requires a specialist to handle it.  In such a way that every company of any size has one that de facto becomes the networking specialist and therefore has a say in what is purchased, upholding the status quo. 
Other platforms like 3com could be handled by any admin thanks to it realisation that we live in a windows world.  But the admin didn’t need to become a “networking specialist”, meaning didn’t need to do the dark art of programming from the command line.  And therefore was not seen as the networking guru.  It was a sad day and the beginning of the end, when 3com tried to make their interface more like  Cisco’s.  That is one thing HP should not follow up on after they bought the company.

Switching is into a revolution with the advance of blade servers.  Large companies would before merge all their standalone switches into large chassis creating a single unified unit for switching.  With the blade more of the single server connections is handled internally, and only the central part is done by a separate switch.  Here there is a task for the server vendors to have a separate but integrated choice of 10gb switches available.  And I am talking copper here.  Fiber is vulnerable to kinks  over short distances and dirt on the connections. = Best suited for longer distance communication.  Like building to building or campus to campus or longer.  For Within the room or within the floor there is nothing that beats the simplicity and the standardisation of the cats.  Though 10gb is not quite there yet when it comes to standards.  Special cables for each manufacturers equipment is not the way to go if you want your solution to spread wide.

When you do get 10gb in, you have the challenge of utilizing it.  And that include monitoring that you do reach the possible speed.  Now we are talking server to storage and racks of other media that would before have been depending on fiber for above 1gb connectivity.   Second by second monitoring is required and I can recommend Utilwatch.  You’ll be lucky if you see even 2gb/sec utilization so there is a lot to be gained for hw manufacturers in ramping up the performance of their equipment.  You can help by getting ssd disks, discussed in the article “SSD a step towards instant computing”

Since we mentioned copper versus fiber and iscsi.  Who let the fiber boys hijack the convention for iscsi node naming.  It would have been much more convenient if this was done to the ip standard rather than the complicated naming concotion of the fiber.  If copy and paste is not your friend, due to 2 separate systems with security between them, you are out with the pen&paper to transfer connection data from server to storage and vice versa.   

Tuesday, 29 November 2011

HW support in a time critical environment

Not long ago hw support on your critical servers meant that when you called the engineer out he arrived with a boot full of parts. This meant that when whatever part you thought faulty was changed.  And if it didn’t fix the problem he would try a number of other possibilities.  This equalled  the engineer was fast on site and then able to do the diagnostics and rectification in a single fast swoop.  How things have changed, and not for the better. 

The callout takes a lot longer to accomplish now.  First you might have to talk to the, outsourced to a third world callcentre.  If your company is English speaking and the support centre’s native language is not you’ll get by if at least one of the parties do have that as a primary language.  Problems start building when none of the 2 parties has the common language as their primary.
Next you will have to do a lot of diagnostics to pinpoint exactly the faulty part, because that is the only thing that will be sent to you.  And yes I did say sent because these days the part comes directly from an outsourced supplier and not with the engineer.   Meaning the engineer will want to ensure that the part is onsite before he/she.  Just so he/she won’t waste any time, as if that was better than to waste yours.  Expect to waste at 1-2 hours from part arrives to engineer arrives. 
If that part was not the only failed item, the process starts over, but this time hopefully helped by the engineer now onsite.  Unless he/she decides that the next part is unlikely to come inside his/her  duty hours and sneaks out the back door.
And remember in all this, the contracted max onsite response time often only starts ticking from when the problem has been diagnosed by phone and the part/engineer is being dispatched.  This often result in that a 4 hour onsite promise is a multi hr diagnostics per telephone and for diagnostics instructions and files to fly bback and forth, and then up to 4hours for the part/engineer to come to site.  

There is also a tendency for hw suppliers to see all means of transportation as having to function for their distribution.  So for rare or just very new types of systems this might mean the missing part has to be flown to the destination.  Don’t expect that to happen if another cloud of ash darkens the sky..  Or what if your hw is broken due to activity that has stopped air traffic, like 9/11.  

Is it not incredible that many hw suppliers has a problem identifying your specific setup every tiem you call them.  Even if that server is the only one you have from that specific manufacturer, you can be sure that every time you call them you have to give them serial numbers and partnumbers,  instead of they just looking up your company name and say “yes we can see it here on our system”.
Vendors need on their internal systems to come up with a  way of giving systems the customers  name for it.  This need to be part of after sales, a much neglected area.  For many hw vendors there is no such thing as “after sales”.  This is completely handled by support, and they are reactive, meaning they only kick in when a problem occurs and the customer contacts them.  Somewhere in between there needs to be something extra.  And outsourcing it to an agent do not work.  They only get paid for sales, and won’t be directly affected if support has issues.

Monday, 28 November 2011

Cooling in a damp climate

On the other hand you have the problem of cooling such a concentrated hotspot.  And air conditioning is not of the most stable devices.  Your indoor environment is sensitive to the smallest bit of sunlight and the outdoor units are very vulnerable all together. It ends up spending a lot of time de-icing so see to that your runoff is adequate.  Can be a problem when your fire extinguishing system needs a completely sealed room.  And your insurance, for it to be pressure tested.
A cool but humid climate is not always the best for a datacenter.  Yes you need to run your airconditioning slightly less but you get a lot more de-icing issues.  One of the reasons reverse cycle airconditioning for home heating never gained popularity in Ireland.  Compared to colder but much drier climates like Scandinavia.  If you have a weather station with a separate outdoor unit you will know what I mean.  They spend a lot of the time showing a humidity error because of very high values.  

Underfloor cooling was meant for network racks where the passage through the rack is unobstructed due to the shortness of the equipment.  Full length servers make blockages for the flow of the air through the rack so it’s better to give it cold air at the front and remove the hot air from the back of the rack.  This way you create cold hallways in front of racks and hot hallways behind racks.  If you have several rows of racks this do require that every second is turned the opposite way, avoiding that one servers hot exhaust becomes another’s cooling air intake.
A downside of hot and cold aisles is that where you are most likely to work, at the front where the console is, is also the place where there is a constant cold draft.  You could alternatively place the consoles at the back. It eases the cabling. These days it’s more normal to remote control the whole room so there is little need for direct human access.  And you could also increase the general temperature of the room slightly. Rather than set it at 19c you could experiment with 22c.

Few will run their cooling via the ups due to the large power demands and the resulting shortening of ups running period at time of grid failure.  If your computer room has generator backup, you will need to restart your cooling with the generator.  Lack of cooling will make your equipments internal fans increase in speed as room temperature goes up, eventually overloading fuses and cables. 
You can temporarily rectify the situation by pumping cold air in from the outside or redistributing the air already in the room better by a dedicated fan and an extendable tunnel, easily and cheaply bought from a hardware store like MachineMart.  
  
Due to the vulnerable nature of airconditioning you will need to overdimension.  You should have at least enough that 1/3 of the cooling capacity can be offline for maintenance and you are still able to keep the temeperature within range. 
It can sometimes be difficult to spot a failing airconditioning.  Simple filter or other error messages on the control panel is mostly self explaining, but sometimes you have a rise in temperature without any message.  Check the exhaust for that it’s actually cold.  Sometimes they keep on running but just blow out thes same air at the same temperature as it went in.  Specially if the outdoor part of the unit has failed.

I will again point out the importance of an environment monitor. They are relatively cheap for what they protect and the same one that monitors your power can also monitor the room temperature.  Place sensors in several different positions since it’s highly unlikely to be a uniform temperature in the whole room.  And single failures can result in hotspots.

Sunday, 27 November 2011

Explosion in power needs

In the later years there has been an explosion in the power requirement per rack.  Not long ago you got 2*16amp sockets, for a and b side, and that was it. And it was like that for 10 years.  Then came the higher density of blades where 16 servers could now fit in a space before populated by 10 or sometimes jut 5.  On top of that each server would have more cores and each chassie would have to have psu’s to cater for it’s top spec  Pretty fast you are requiring more like 4*32amp per 10u and fuses where tripping all over the place.
Yes you can power manage by limiting the power each server and chassie can use, but then you can never run at your top capacity, so why did you buy it.  You will also have startup issues if you have total power failures.

For security against the frequent failures or just scheduled maintenance of the normal power grid most companies with in-house servers has some form of a ups system.  Here the problem is they seldom last for more than 10 or 20 minutes if you are lucky.  They will be based on batteries and batteries are not a good way of storing any significant amount of power when it comes to appliances that use large amounts at 220v. 
And what can you do in let’s say 15 minutes.  It’s hardly enough time for an admin to shut down the most essential databases.  (Oracle do not enjoy a sudden and complete loss of power).  Most will use best part of that time to trigger the alert.  Here an environment monitor like Avtech is worth its weight in gold for fast sms notification.
  
Most companies above a certain size will backup their ups with a generator.  And I do say “a” because very few beside dedicated data centres that offer services to third parties, has more than 1.  What they forget is a generator is more like a car.  How sure are you that your car will start first time after standing idle for a few weeks.  Regular testing is required but most generators stand around for many years, so now we are talking about a 20 year old car.  Yes it doesn't have much mileage, but that is not always a good thing.  Diesels like to be run.
If you try to solve this by a second generator you are in for a very complicated and vulnerable fail over system, to ensure that every part is redundant.  And somewhere in the middle there will be a some sort of a vulnerable failover switch.  Remember also you don’t want to make it so complicated that it induces more risk than what you where guarding against.

You could try to get a second grid supply but in most places you will find that an actual physical separation on the supply side is nearly impossible. Competition just hasn’t got that far.  You will also run into the same problem as for a second generator, how to feed power from 2 sources.

Saturday, 26 November 2011

SSD a step towards instant computing

 Ever since I first started working on optimizing server performance I have felt that the ultimate goal is instant computing.  Where I define instant as no for the user conceivable delay from the user from request to result.  Unfortunately few suppliers has set such, for the outside observer, quite natural goals.  They are usually just happy with a bit faster than last year or a bit faster than the competitor.  So you will run into a load of configuration limits for system parameters that hasn’t kept up with the explosion in hw possibilities combined with the lowering of price/performance.

As soon as you overcome one bottleneck it’s on to the next one.  Part of this quest has been to get as much of the data into memory as possible to overcome the slowness of traditional spinning plate disks.   With the arrival off ssd’s I thought we could be close to this goal.  And for sample email searches in Outlook it’s close.  If you had a few thousand emails searches takes an age because it goes on in the cache your Windows pc stores locally.  If you use an ssd disk in your laptop/pc it’s down from minutes and sometimes hours to seconds.  The greatest leap ever, but so little appreciated that even Dell stopped (for a while) putting ssd’s as an option even on their high end pc’s.

The greatest gain for servers is obviously where there is a high frequency of ever changing data.  Like database logs.  Unfortunately also the one area where the recommendations are not to use them due to the ssd’s limitation of total rewrites.  There is work going on to automatically exclude areas that nears this limit.  Though not fast enough for some that reached it with total failure of whole disk shelves as a result.  This write limit should also be a thought for san manufacturers that automate on what type of disk the different types of data are stored depending on their frequency of access.  Maybe one should just take the penalty and routinely change out the disks every about 18 months.  An easy task with proper raiding.  And if you went with the cheaper server type or medium sized storage ssd’s instead of the super san = super expensive ones, still a cost effective way.

Aside from that log versus max total writes anomaly databases has much to be gained from ssd’s.  Specially they so large that they can’t be all sucked into ram or where there is a high frequency of updates and where one for security precautions prefer the synchronous write instead of asynchronous.   

Server internal ssd’s are actually an alternative for servers that before was optimised by utilizing the caching ram of an external storage unit.  This way saving considerably on your next system hw  upgrade.  

Friday, 25 November 2011

Backups, art that needs reinvention

Some of the articles you see about data lost in the cloud is beyond belief. There is no excuse for loosing data that was stored more than 24 hours before the problem happened.  Most storage users will have a few snapshots and a dr tested way of restoring them.  The problem comes when you go beyond the snapshot that is still on disk.  Backup of snapshots to other medium is still in its infancy.  The most prominent of backup solutions jsut don’t have it in them  And I have seen virtual server systems presented as complete solutions without a thought for how to get the data back if the thing burned down or, currently more likely, was drowned in a flood.  There is a job here for a specialist in deduping, with the added flavour of a couple of extra copies.

There is a tendency to not treat virtual servers as real servers.  Of course you can restore all the physical servers. But what about the virtual ones.  With dormant or little used virtual servers a lot of them can fit on a few physical hosts.  But the total data can still be the same as if each server was a separate physical.  If you haven’t backed it all up, you need to at a minimum have a definite restorable master and a record of all the steps taken to create each one.
We should not either forget the data people bring around with them. As laptops get ever more capable, most now more powerful than servers where 4 years ago. Developers like to have it all at hand.  A very important part of that time critical project might has its only copy on a thing thrown hither and dither every morning and evening.  Greatly encouraged by the cheap developer tools  licensing we see emerge as a teaser to get more people onboard.  And developers never where the first to think about what happens when things go wrong, or whether that online storage deal included a quantifiable and guaranteed backup/restore.

Often the issue is it takes a long time for a user to discover that their data is actually no longer there.  Today even the smallest of user can have thousands of files.  And since nobody longer learns about file system and folders they never see them except when they need them. It can take months or years if they are only used at the annual budget time or multi yearly planning stage. For that amount of data/iterations it is/was often uneconomical to store it all on disks.  Besides your auditor probably still loves the tape.  

We also have the fast pace of the technology. A much used refresh cycle is 3 to 4 years due to the rapid rise in hardware support costs after the initial contracted support period.  But the requirement is that financial data is to be stored for 7 years.   Ask your IT department if they can restore you a 7 year old backup.  Even if they have the tapes do they have the drives to restore them with or the system to restore them on to.  Not such a large problem if the software system is still in use and the data stored in a database.  They are easy to migrate with the hardware refresh as long as you haven’t segregated out to much of the old to fit the new.  Still you can always add some more modern storage to get those data back in, if you planned for that eventuality in the first place.

Relational databases – a quick look at flavours strengths and weaknesses

Let’s start with the master of them all Oracle.  It’s the db with all the tools, tweaks and it scales well.  And if the price was right this is the one most would or should pick, however it seldom is. Oracle never followed the development in the processor where each core get weaker but you get a lot more of them. Hence their penchant for charging per core and their customers liking of the HP Itanium processor.
 Oracle is so advanced that it’s more like an operating system in itself and you need to take your patching seriously.  Also be into your file system details. Play with the config files, there is a lot to be gained.  It’s a pity the 3 defaults of small medium and large is not more up to modern standards.  Proper bakcups are essential. 
Oracle do not like loss of any of it’s data.  And since a lot of performance can be gained from running it raw, a simple file system backup won’t do the job.  You need to learn about dumps like dd, and have it done in the correct order.  Exports is also very important. In addition to being a secondary way of doing backups, they can also give you a lot of hints on fragmentation and proper sizing.  Don’t either forget to have multiple control files in many separate locations.  
It’s the one db where you really can’t live without a support contract from the mothership.  And if you have a set of the printed manuals, they will be from a previous version but they are worth their weight in gold. And 95% of them is still applicable.  Read all about it’s system tables.  There is a lot to be gained here. For standardisation and easy admin to admin transfer have a look at the old OFA manual.

It’s nearest competitor as a multi os db is Sybase. Now owned by SAP. A brilliantly designed but more simplistic model. However you’ll have problems getting more than 1 installation (version) onto a single server. Instead it uses what they call userdatabases. Requires a strict discipline as an admin so you know which one you are in. But organizing the file storage and backups are a lot simpler
It’s penchant for “go” is not as good as Oracle’s execute command, and it’s method of dumping output to file is archaic. Like Oracle it’s very sensitive to playing with the kernel settings on unix/linux. Most of it’s performance is to be gained here. In addition to, like most relationals, a good scheduled reindexing.  A good set of Sybase’s own manuals will go a long way for your support needs.

Mssql could have done the knockout on the other db’s if it hadn’t such a scaling problem. It depends on a single server, and Windows on top of that, and can only scale upwards at the speed of the hardware development. Windows Datacenter is an option but due to its obscurity and odd Microsoft rules on deployment, Windows Enterprise is really your option. And then we are back to this processor thing again. It is a database that most admins can manage though, even without the scarce manuals.  They might not utilize all it’s potential but any Windows admin can make it run. Just give them a few hints or a small course on simple housekeeping like dumps and scheduled reindexing/reorg.
Mssql’s testing/analyzing tool is very good but it’s not as handy as Oracle’s command line “desc” for analyzing sigle sql queries.  However it does give you a nice way of presenting your findings.

Adabas owned by the German Softwareag is a story of what could have been. Popular among some German companies/developers it never reached the popularity of Sybase. If you have seen it its probably because you had a system from a German company that was based on it.  Very simple to manage, don’t even need a manual to start, stop and backup this one.  Low cost and flexible.  It’s ripe for a large multinational to take it over.  Somebody with a long reach, believe and financial muscle to push it into the limelight.

Mysql the developers  favourite due to their perception that it’s “free”.  Now owned by Oracle. There is no such thing as a “free lunch” however.  What you don’t pay for the software itself, you,  due to it’s popularity among specialists, will pay for in admins.  Recommend testing your restores frequently. Specially when it comes to getting back the last data entered.   Ripe for a organized set of admin tools. Oracle has a long way to go, and lots of opportunity for ad on profit making.

A problem with all relationals is that they are good for adding and picking/filtering small amounts of data and creating automation for repeated actions, but when you reach certain level of reads needed it’s better to forget about the indexing.  When that happens the old ways are better. The db’s  security against data loss also makes them vulnerable for slow down by locks and erroring by deadlocks. This is why if you have to read all the inputs/data it’s faster to use the file system directly for your (interim) storage, without the overhang.  Many large players do.

Prioritising when everybody is screaming

We have all been there, just one of those days when things go wrong. And sods law says they all come together.  Now you need to prioritise.  If you have done your preparations this should be easy in your head.  You have your list of systems in prioritised order as a result of their importance to the company and the immediacy of the effect of downtime.  Adjusted for top management priority and current interest. 

Let the admins get on with the job, see if they need additional outside help, and keep the top brass away.  Most fixes are based on the rolling halv hour.  It will take halve an hour to know if this will work before we discover the next issue, or try the next possibility.  Be ready to run parallel avenues or make sharp decisions.  If your systems are properly admin’d / backed up there is usually a fast way or a slow way to restore.  Problem is, when do you abandon the fast way and go with the more secure but slow.

Avoid falling in the trap of helping the biggest nuisance user first, or the directors darling.  The dangerous complainer is the one that says nothing to your face but complains without you knowing and without change of response.  And their argument will stand if their system should have had priority.   Many systems are important but they can take a certain amount of downtime. How many accountants do you see outside hours or in weekend outside of the budget and reporting cycles.  Use your urgency/dependency listing from the DR plan to help  you.  Think of when they normally schedule upgrades.  You will also thank yourself for not storing all the eggs in one basket or all the data on the same san.

Recruiting, hints for managers out hiring

First find out what you are looking for. If you are hiring for a manger are you sure you want the same as the last one. He/she might have worked out well but their field speciality is probably well covered for now. Maybe the next one should be slightly different. 

Time from announcement/advertisement to first interview should be as short as possible. Specially if you recruit for tech jobs where there are many companies looking for the same candidates. How many times has your company lost out on a candidate that is no longer available. How many has turned down an offer of interview. If the number is high you need to look at your process again.

If your plan is for more than 2 rounds of interview your recruiting is not as efficient as it could be and you are likely to lose out on the best candidates.  Did you weed out enough of the chaff by reading the cv’s thoroughly or are you wasting everyones time by just skimming them for the first time at the interview.  Large multinationals are big sinners in having many rounds of interviews. Is that a sign of too many corporate layers = bureaucrazy. Is there to many people involved in your decision process. However if you the manager ain’t technical, bring an expert from your team. It gives them the chance to meet their potential future colleague.

When doing the interview do you do the “take me through your cv thing”. That means you have to fit it to the job. An alternative approach could be “take me through samples of your experience for each of the requirements in our job description”.  This will give the candidate the opportunity to bring in more relevant stuff, and will let you see if they can translate their earlier experience to their new tasks.  

Do you use a technical test already at first interview. It will help you confirm your initial opinion, and you can have more experts evaluating it, covering a larger technical field.  There is nothing wrong with programming on paper. Many universities still use it for their exams. Personally I don’t see any problem with manuals, helping aids or mobile phones either. If they can get assistance at the test, they can get it in their work, and what you really want is somebody that can complete the job.  
There is nothing wrong testing for all the nice to haves also.  Remember one thing, if the candidate knew the answer to all the test questions, the test wasn’t hard enough.    

When you have done a few interviews you know the do’s and dont’s. Bring personnel in on the second round, reducing the times you have to wait for their availability. And they don’t really need to meet anyone that isn’t to be hired. One of the biggest delays can be organizing a time that suits everybody.  Therefore bring as few interviewers as possible, but always minimum 1 other for legal reasons.  And It gives you thinking time.. If you are not the final decision maker, think about if you yourself need to see the candidate more than 1 time and leave the final interview to the decision maker and personnel.  I would suggest just 2 for second rounds, just so there is a choice. And you don’t have to send forward anyone you can’t live with yourself.

If you are unsure of a candidate, or rather you think there could be a better candidate out there who’s cv you haven’t seen yet. If the potential candidate is free on the market at the moment, take a chance.  That’s what probation is for.  It’s takes less ruthlessness  to trial somebody currently out of work, than somebody that has to quit their current job.  And your deal with the recruitment agency should always include a step down ladder in fee if a candidate is later found not suitable.

Lastly, give a thought to all that was unsuccessful. It won’t cost you much to tell them, but it will mean a  lot to them to know.