V&C Solutions

IT Services

Big Data Consulting, Hadoop Consulting & Support in San Jose & Santa Clara

What is Hadoop?

Apache™ Hadoop® is an open source software that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer while using the resources of multiple mid-range servers.

Apache Hadoop has two main subprojects:
• MapReduce
 – The framework that understands and assigns work to the nodes in a cluster.
• HDFS – A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

 

V&C Solutions can help you create a reliable Hadoop infrastructure at your datacenter, office, or cloud. We pride ourselves on providing a full range of Haddop services including consulting, management and support for Hadoop cluster of any size.

The first decision you will come across is whether to put Hadoop directly on your servers, or put a virtual layer between Hadoop and the hardware such as OpenStack or VMware.  It comes down to a maximum flexibility vs. maximum use of resources.

Physical Hadoop gives the Hadoop cluster full control over every resource and is the most mature in install an implementation.  This means just about every feature you would expect from the cluster will work and has been well-vetted by the industry.  With Hadoop in full control of all aspects of the hardware, it can use those resources in the maximum way for the jobs submitted.  On the downside your cluster is the only thing that will make use of that hardware and while Hadoop itself is multitenant, if you need data separation for non-technical reasons, a new cluster would need to be built with additional hardware, or the current one split up.

Virtual Hadoop uses a virtual layer like OpenStack to present a set of resources as a pool.  This means you could deploy multiple Hadoop clusters onto it and choose how much of those resources should be given to each cluster.  This also makes deploying a Hadoop instance faster and wiping it away just as easily to adjust that resource utilization.  The downside to this is that there is now an extra layer to be dealt with in initial setup and those resources are no longer directly under the control of the cluster.  Lastly Hadoop on a virtual layer is not as mature as the physical model so at times there are bugs to be squashed during setup.

Still not sure if you like Virtual Hadoop? Check out this demo:

If you want to get access and play with real OpenStack Hadoop contact us and we will send you a email with link,username and password  to real Infrastructure.

Contact Igor at igor @ vncsolutions .com or call 408-217-6002

In general we recommend the virtual layer because the benefits typically outweigh the cons, however, every environment is different so we work with you to find the best solution to suit your needs.  There is a nicely written article by VMware, which is not completely unbiased, but the information is still valid.

http://www.vmware.com/files/pdf/Benefits-of-Virtualizing-Hadoop.pdf

Additional decisions such as where to put your cluster, what distribution to use, how to segment the resources, etc.  There are too many options to list here, but we can help you get started with the right one for you.

V&C Solutions has unique talent in BigData space and proud to have an expert team that is specializing on Hadoop consulting  infrastructure build up, management and support. Contact us if you have any projects that require MapReduce functionality.

We help Bay Area businesses, from San Jose to San Francisco, construct new Hadoop solutions, maintain and support your organization during or after cluster is finished.

 

Ask the Experts

V&C’s BigData Q&A

Click the question, see the answer.

What is Hadoop?

Apache Hadoop is an open source Java based software framework for distributed processing of large data sets across clusters of commodity servers. It is originated from Google ‘s map reduce source code as no sql db used for dealing with thousands of Tera data bytes i.e big data. It is designed to  facilitates rapid data transfer and scale up from a single server to thousands of machines.

Why Hadoop?

Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.

Hadoop enables a computing solution that is:

  • Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
  • Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
  • Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Think Hadoop is right for you?

Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?

 

 

Why BigData at V&C?

V&C Solutions has unique talent in BigData space and proud to have a team that is specializing on Hadoop infrastructure build up, management and support. Contact us if you have any projects that require MapReduce functionality.