Sunday 4 December 2011

When did the full dump ever help

Most advanced systems will automatically do a dump of their memory, or challenge you to do a dump if they recognise a failure has happened.  Some os software vendors also love the dump. Problem is that if the system was able to recognize the cause of the crash, it wouldn’t have crashed in the first place, and a dump is usually just a snapshot of what is in memory at that exact time and tells little about the action prior to the problem. 
If a system was able to recognize a crash situation it means the vendor when building it knew this could happened and included a way of logging it.  If they had known it could happen they would have put in measures to prevent it in the first place.  Most crashes is due to unforeseen circumstances and can therefore not be logged.

A downside with dumps is that they usually become very large and take a long time to extract.  If you successfully extract them the tools to analyze them are either longwinded or difficult to interpret the results.  This means you usually have to upload them to a os or hw suppliers site.  And internet connections have increased a lot in size but the amount of data in these dumps mean you will be clogging a link for a long time.
 
After all that 99% of the time you will get back, nothing found.  I will advocate that it is a lot better to do targeted log extracts.  As an admin you will usually have an idea on where the problem lay.  Work with the developers or suppliers of the application your run for finding the best tool for logging what is going on.  Then play with the parameters of the logging tool at the same time as you put load on your system.  This may take some provocation, like artificially increasing the load or reducing the capacity of the system.  Easy if you have a multi computer system – turn off some of the resources.  But even on a single system you can sample limit the number of processors used or run up an additional load (can be from an additional dummy program). 

There can be many different causes to system crashes / malfunctions.  I have experienced amongst other missing non-public patches, bent processor pin, bad programming and the reaching of system limits.  These last can and have been in  os, db, app and hw.   What I haven’t experienced is that any of them has been diagnosed correctly and the solution found from a full system/memory dump.  

Friday 2 December 2011

Danger of overcomplicating

Today there is many additionals to os and databases that will keep systems running or automatically fail them over if a problem is detected.  Well and good as long as everything run according to plan, which of course it never does
What of the undetectable problem.  There is no reason a well supplied and admin’d system should fail for a known issue.  (Unless that issue has been kept secret from the suppliers side).  If the issue is known it should have been patched.  But there are many possibilities for issues .  very few systems are exactly the same due to the manual ways of installing a system and the many permutations possible when it comes to server, storage, networking, os, database, application, adorns and  patching of them all.  This usually means that a change can at any time lead to an unexpected event.  The only way of taking fully height for this is to have another unchanged system, just in case.

Do not fall in the trap of creating more problems  than you guard against, resulting in more downtime rather than less. What these automatic additionals do is add complication.  More layers of things that can go wrong.  There is a lot to be said for the old manual failovers or restarts as long as there is a 24x7 human interface in place.  Yes they had a time delay in data replication, but this could be controlled by you the admin down to a, for the company, acceptable level. Most can live with that if it means higher security of the system = less dependency on the “no system available” manual routine.  And higher security regarding the maximum downtime.

Often the fastest resolution is a quick reboot or to fail over to a completely independent system running a bit behind the main system this can be caught in a non failed state.  If you make systems that can automatically failover you often have sample the databases running exactly in sync.  This can lead to that both db’s have the same error.  You can also have problems with the failover process  and a worst case scenario is that both systems ends up in a hung state.
Not that a manual secondary system is any guarantee.  It requires strict discipline by the admin to see to that it is fully updated to a runable state.  Regular testing will be required, and I would recommend regularly do planned switching between the live and the standby.  This to ensure that both are in a production capable state when you need them.

Automation and full synchronisation can give problems at time of upgrade or patching.  How do you patch in such a way that at least 1 full solution is available in a pre patched state.  And stays that way  until you know that the patch isn’t going to cause any issues.  What do you fall back to if you upgrade your live and your standby as one.  

Thursday 1 December 2011

Decision making and the art of management

You are a manager, take that decision and live with it. The advantage is that if you select door a over door b, nobody will know what would have happened if door b had been selected as long as you see your decision through to fruition. 

In all decision making there is a bit of “no risk no rice”.  You make your choices and take your chances.  But see to that you get a result.  Nobody can for sure know that the other choice would have been better as long as you make your choice do the job. Abandoned projects in the technology sector is a sign of lack of work at the preproject state, unless you are engaged in r&d.  There is a reason it’s called the bleeding edge.  There is no reason for why a company where IT is in support of the primary business rather than is their primary business should need to be on it. 
That said a successful new way of using IT in a business can give you a competitive edge, but the risk need to be managed and the potential cost upfront. 

If you want good admins enable them to make some decisions.  But always remember that you can delegate the decision but not the responsibility for it.  Train them in your way of thought so they know the direction you would have taken, and therefore is most likely to approve off. 

There is nothing wrong with, if there is time, a healthy discussion on alternatives.  Hearing somebody else’s view could increase your own knowledge and let you see possibilities you might not yet have thought of.  If you are sceptical, play devils advocate and magine up the worst possible scenarios to see if they have a solution for that to. 

The worst thing I see is management bringing in consultants to make the decision for them.  A popular way in government and the bureaucrazy.  If management can’t make decisions maybe the problem is just there.  It leads to lack of accountability, but accountability a very important part of management.  

Customer support an activa or a burden

Is there a point to ignoring customer support if it can be done for negligible additional cost.  Many companies sees customer support as a way of retaining previous customer so they come back for more.  Other sees it as a legal requirement that must be bared.  Some see it as a way of making their business stand out from the crowd.  Other sees it as nothing but a cost. 

Support can be done for little or no money.  A faq on your website is the easiest sample.  It takes little time to assemble a list of possible questions and answers about your product.  Other is more resource depending.  Like having somebody actually answering questions that customers communicate in, but it can be made profitable by making the customer pay extra for the privilege.  Sample support contracts, or a callcenter with a premium phone line.  Selling insurance can also be a way of taking payment for your support, or outsourcing to a 3’rd party.   A modern way in the internet age is to facilitate a live question and answer page where the answers are provided by 3’rd party agents who finance their time by leading you to additional paid for ads or services.    

Some forms of support can be seen as profit reducing.  In the shadier side of business a sample can be telling potential customers how to avoid the built in pitfalls in the purchasing process so they can reach the best bargains.  Here there is a fine line between naturally occurring issues with the purchasing process and deliberately engineering profit making problems.  Alternatively not prioritising fixing issues when they are discovered.  Luckily for your customer, if your business is very large there is 3’rd parties at hand, let’s call them agents or facilitators, that will help them overcome/bypass the issues, for a fee of course.  Thank’s to the help of the search engine many websites can also be found that will help a frustrated customer.  If you are a business owner you will have your work cut out finding them and let’s say ensuring that they are corrected.

Is no customer support a bad thing.  Not necessarily if you have a unique selling point that brings customers back to you regardless.  Sample if you are a monopolist, they who want your product has no choice and  have to buy from you if they want that product.  This is one of the reasons monopolies are frowned upon legally.  They take a lot of time an effort to police to avoid questionable profiteering.  

Importance of monitoring what you have outsourced

Your outsourced system is never as important for your supplier as it is for you.  Most contracts has check times counted in minutes, and by the time the set amount of alarm has been triggered, to avoid false positives and an operator has been alerted 15 minutes can easily have gone.  And 30 minutes or more before anybody takes it in hand.  Since you squeezed the price you pay for the service down to the absolute minimum the agreed penalty is seldom in relation to what the outage means financially to your organisation.
Another reason for doing your own monitoring is that it will give you the unmasked truth.  Do you trust your supplier to always tell you what’s going on.  Is their answers at times vague or slow forth coming.

The easiest way to see traffic is by network monitoring. A simple network graph from a tool like Utilwatch will give you second by second information, and can run on the cheapest oldest pc you have.  If it’s running in the background but within your field of vision you will immediately know if something is amiss.  Experience lets you interpret the data better.  You can also via simple scripts create easy traffic-lights.
Cheap second by second tools do however seldom store the data. They are wysiwyg.On screen current display only. You seldom need to store this much data though.  The interpretation is dependent of other factors at the time.  Like did you start/stop something.  Where your web caches reloading,  Was the blip due to a scheduled maintenance.  A simple screenshot will capture the moment for later inclusion in a manual log together with comments.  

There is also many tools that let you set up triggers and alarms to your own liking.  I would pick at least one that isn’t from the supplier of what you try to monitor.  If the supplier know how to monitor it / trigger the alarm they would/should have fixed the problem in the first place. 
Some like ipmonitor is also cross platform, and store the history of previous alarms if configured correctly.  If your urgency is lower in priority, and/or your problem is outside normal hours tools like Cacti will give you a view of last nights/weeks/months proceedings.

If you don’t feel like spending time or effort on monitoring yourself but still see the value of an outside eye on your hosting/network/resource provider there is many third party suppliers that will happily let you try before you buy their monitoring services.  But then you are back to the 15-30  minutes instead of seconds response again.