Big Data Consulting, Hadoop Consulting & Support in San Jose & Santa Clara
What is Hadoop?
Apache™ Hadoop® is an open source software that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer while using the resources of multiple mid-range servers.
Apache Hadoop has two main subprojects:
• MapReduce – The framework that understands and assigns work to the nodes in a cluster.
• HDFS – A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
V&C Solutions can help you create a reliable Hadoop infrastructure at your datacenter, office, or cloud. We pride ourselves on providing a full range of Haddop services including consulting, management and support for Hadoop cluster of any size.
The first decision you will come across is whether to put Hadoop directly on your servers, or put a virtual layer between Hadoop and the hardware such as OpenStack or VMware. It comes down to a maximum flexibility vs. maximum use of resources.
Physical Hadoop gives the Hadoop cluster full control over every resource and is the most mature in install an implementation. This means just about every feature you would expect from the cluster will work and has been well-vetted by the industry. With Hadoop in full control of all aspects of the hardware, it can use those resources in the maximum way for the jobs submitted. On the downside your cluster is the only thing that will make use of that hardware and while Hadoop itself is multitenant, if you need data separation for non-technical reasons, a new cluster would need to be built with additional hardware, or the current one split up.
Virtual Hadoop uses a virtual layer like OpenStack to present a set of resources as a pool. This means you could deploy multiple Hadoop clusters onto it and choose how much of those resources should be given to each cluster. This also makes deploying a Hadoop instance faster and wiping it away just as easily to adjust that resource utilization. The downside to this is that there is now an extra layer to be dealt with in initial setup and those resources are no longer directly under the control of the cluster. Lastly Hadoop on a virtual layer is not as mature as the physical model so at times there are bugs to be squashed during setup.
Still not sure if you like Virtual Hadoop? Check out this demo:
If you want to get access and play with real OpenStack Hadoop contact us and we will send you a email with link,username and password to real Infrastructure.
Contact Igor at igor @ vncsolutions .com or call 408-217-6002
In general we recommend the virtual layer because the benefits typically outweigh the cons, however, every environment is different so we work with you to find the best solution to suit your needs. There is a nicely written article by VMware, which is not completely unbiased, but the information is still valid.
Additional decisions such as where to put your cluster, what distribution to use, how to segment the resources, etc. There are too many options to list here, but we can help you get started with the right one for you.
V&C Solutions has unique talent in BigData space and proud to have an expert team that is specializing on Hadoop consulting infrastructure build up, management and support. Contact us if you have any projects that require MapReduce functionality.
We help Bay Area businesses, from San Jose to San Francisco, construct new Hadoop solutions, maintain and support your organization during or after cluster is finished.