Azure HDinsight versus Azure Databricks

Azure HDinsight versus Azure Databricks

Schermafbeelding 2020-01-31 om 10.28.23

Remon van Harmelen

Azure has multiple analytical tools nowadays. In this blog, I wanted to talk about Azure HDinsight and Azure Databricks and give a bit of background on them. One of the main questions is when would you choose one over the other.

Azure HDinsight

First, let’s call it what it is: it’s Apache Hadoop running on Microsoft Azure. This means that we now have a cluster available in the cloud. Starting with some background on Hadoop:

Hadoop: An open-source framework for storing data and running apps on clusters. It offers massive storage for any data, lots of processing power. It can handle virtually “limitless” concurrent tasks. Hadoop has been declared open source and is now named Apache Hadoop.

In Azure, we can pick the following clusters that we may need in certain circumstances:

  • Hadoop: Petabyte-scale processing.
  • Spark: Fast data analytics and cluster computing using in-memory processing.
  • Kafka: High throughput, low-latency, a real-time streaming platform using a publish-subscribe messaging system.
  • HBase: Fast and Scalable NoSQL database.
  • Interactive Query: Uses Hive (SQL on Hadoop) and LLAP (Low Latency Analytical Processing).
  • Storm: Real-time streams of data through reliable processes.
  • ML Services: A server for hosting and managing parallel distributed R processes.

We can only select one type of cluster during the configuration of the HDInsight. The HDinsight cluster cannot be turned off, so this can result in high costs during low use situations. For Active Directory integration with HDinsight, we need a few components to make it work. You will need the Enterpise security package (ESP). For this, you will also need to deploy Azure Active Directory Domain Services. There is a high availability guarantee from Microsoft.

In short, Azure HDInsight provides the most popular open-source frameworks that are easily accessible from the portal. If you need a combination of multiple clusters for example: HDinsight Kafka for your streaming with Interactive Query, this would be a great choice.

Azure Databricks

Azure Databricks is a newer service provided by Microsoft. Let’s start with some background information about Spark and Databricks:

Spark: General purpose distributed data processing engine. It can be used for a wide range of circumstances. It uses a lot of libraries that can be used. For example: SQL, machine learning, graph computing, and streaming processing. Spark does not provide storage, only a computation engine. Spark extends the Hadoop MapReduce framework to work in an optimized way.

Databricks: Databricks was founded by the creator of Spark. The team behind databricks keeps the Apache Spark engine optimized to run faster and faster. The databricks platform provides around five times more performance than an open-source Apache Spark. With Databricks, you have collaborative notebooks, integrated workflows, and enterprise security. This will be in a fully managed cloud platform.

Azure Databricks works on a premium Spark cluster. This one is faster than the open-source Spark. Azure Databricks is a PaaS solution. It doesn’t require a lot of admin work after the initial setup. It is providing security thanks to the Azure Active Directory integration without any need for custom configuration. It brings you all the pros that Databricks brings to you only then in Azure.

Conclusion

The choice between Azure HDInsight and Azure Databricks depends on the use case that you want to solve. The biggest one is how are the data scientists going to work? Are they going to work without collaborating then it could be wiser to choose Azure HDInsight. Will, there be a lot of collaborating, then Azure Databricks can bring you the extra mile due to the shared notebooks and readily available workflows.

If you only need a spark cluster, then Azure Databricks will bring you that as it has better performance then an open-source Spark cluster.

If you would like a Kafka based streaming service that is connected to a transformation tool, then the combination of HDinsight Kafka and Azure Databricks is the right solution.

If you have a lot of long running jobs that need high power then Azure HDInsight could be better then Azure Databricks.

Schermafbeelding 2020-01-31 om 10.28.23
Remon van Harmelen

Cloud consultant