I’ve often heard of organisations that say they’re going to “try Hadoop”. That’s great that you’re going to see what it can do for your business and what it is capable of, but Hadoop is one of those technologies where if you don’t commit time and resources to it, then it’s going to be a failure. And I can virtually guarantee that success won’t come easily. But don’t let that deter you from jumping in. But read on for some helpful hints.
One such experience I have to recount recently was for a customer where we are working on identifying and exploring multiple data types and using Hadoop, Aster and Teradata to produce insights. The first step in this project was to build a Hadoop ecosystem. Easy one might say, but I’ll list below some home truths that you’ll need to factor in.
1. OS build
Of course in the Hadoop world, you have a wide choice of linux distro’s (or even Windows) to deploy on. But you really should deploy on the platform where your staff have the strongest skills. There are subtle differences in the command line instructions between distro’s. Choosing the right distro is more about helping your staff deploy smoother and manage easier than any technical consideration.
2. Internet access
Sounds simple, but if you are security conscious then you’re probably going to have your cluster without internet access to prevent breaches. Why is this important? Then read the next point.
3. Package dependencies
During deployment, Hadoop loves being connected to the internet because you need a host of packages from JDK’s, python and JDBC drivers during the installation. If you don’t have internet access, then be prepared to double the effort to deploy.
4. Hardware selection
We’ve found that you should carefully plan your hardware infrastructure to support your cluster. Incorrectly sizing your hardware and configuring Hadoop to not suit that hardware can result in serious performance issues.
5. Security hardening
I find it is easier to deploy Hadoop in a fairly low security configuration. This is because there are a range of ports that Hadoop talks on and having an incorrectly configured firewall can cause you problems. So after deployment, set aside time to identify how to customise your firewalls, user and group settings, Kerberos and ssl settings.
If you’re going to DIY, then keep in mind you’ll have to figure out a support model. In house, if you’re going with Apache or Enterprise support, if you select a Hadoop distro such as Hortonworks or Cloudera. Don’t forget to also factor in the support of the operating system.
So, if you are going to do it yourself then pay attention to the above. You’ll have a lot of decisions to make and obstacles to overcome. However, upon completion you’ll intimately know your cluster and how it handles.
Alternatively, if you don’t want to go through the pain you can always opt for an easier solution in the form of a ready built Hadoop Appliance or Hadoop in the cloud. These options avoid the heavy tasks of installing and configuring your cluster from scratch.
If you look at deploying Hadoop from a Total Cost of Ownership (TCO) perspective, then you should be looking at the costs over a 3-5 year time frame.
Deploying Hadoop on virtual servers or your own servers is usually 30% cheaper initially than a Hadoop Appliance. However, once you factor in the additional setup and support costs over the longer term the cost model will change significantly. See the diagram below for what I mean:
Ultimately, Hadoop offers you a variety of deployment options. That’s the advantage of the framework – it’s entirely up to you how you want to approach your first Hadoop deployment.
One thing for certain is that going down the road of DIY is going to be a learning experience for your organisation. But that’s the beauty of DIY because you get to know the in’s and out’s of the platform and whether it is suitable for your business needs. Research and read a lot and carefully plan deployment to ensure future success.
The post The pitfalls of DIY Hadoop appeared first on International Blog.