At its Ignite conference today, Microsoft announced the launch of Azure Managed Instance for Apache Cassandra, its latest NoSQL database offering and a competitor to Cassandra-centric companies like Datastax. Microsoft describes the new service as a ‘semi-managed offering that will help companies bring more of their Cassandra-based workloads into its cloud.
“Customers can easily take on-prem Cassandra workloads and add limitless cloud scale while maintaining full compatibility with the latest version of Apache Cassandra,” Microsoft explains in its press materials. “Their deployments gain improved performance and availability, while benefiting from Azure’s security and compliance capabilities.”
Like its counterpart, Azure SQL Manages Instance, the idea here is to give users access to a scalable, cloud-based database service. To use Cassandra in Azure before, businesses had to either move to Cosmos DB, its highly scalable database service which supports the Cassandra, MongoDB, SQL and Gremlin APIs, or manage their own fleet of virtual machines or on-premises infrastructure.
Cassandra was originally developed at Facebook and then open-sourced in 2008. A year later, it joined the Apache Foundation and today it’s used widely across the industry, with companies like Apple and Netflix betting on it for some of their core services, for example. AWS launched a managed Cassandra-compatible service at its re:Invent conference in 2019 (it’s called Amazon Keyspaces today), Microsoft only launched the Cassandra API for Cosmos DB last November. With today’s announcement, though, the company can now offer a full range of Cassandra-based servicer for enterprises that want to move these workloads to its cloud.
That was Google Cloud CEO Thomas Kurian’s simple answer when I asked if he thought he’d achieved what he set out to do in his first year.
A year ago, he took the helm of Google’s cloud operations — which includes G Suite — and set about giving the organization a sharpened focus by expanding on a strategy his predecessor Diane Greene first set during her tenure.
It’s no secret that Kurian, with his background at Oracle, immediately put the entire Google Cloud operation on a course to focus on enterprise customers, with an emphasis on a number of key verticals.
So it’s no surprise, then, that the first highlight Kurian cited is that Google Cloud expanded its feature lineup with important capabilities that were previously missing. “When we look at what we’ve done this last year, first is maturing our products,” he said. “We’ve opened up many markets for our products because we’ve matured the core capabilities in the product. We’ve added things like compliance requirements. We’ve added support for many enterprise things like SAP and VMware and Oracle and a number of enterprise solutions.” Thanks to this, he stressed, analyst firms like Gartner and Forrester now rank Google Cloud “neck-and-neck with the other two players that everybody compares us to.”
If Google Cloud’s previous record made anything clear, though, it’s that technical know-how and great features aren’t enough. One of the first actions Kurian took was to expand the company’s sales team to resemble an organization that looked a bit more like that of a traditional enterprise company. “We were able to specialize our sales teams by industry — added talent into the sales organization and scaled up the sales force very, very significantly — and I think you’re starting to see those results. Not only did we increase the number of people, but our productivity improved as well as the sales organization, so all of that was good.”
He also cited Google’s partner business as a reason for its overall growth. Partner influence revenue increased by about 200% in 2019, and its partners brought in 13 times more new customers in 2019 when compared to the previous year.
MongoDB is hosting its developer conference today and unsurprisingly, the company has quite a few announcements to make. Some are straightforward, like the launch of MongoDB 4.2 with some important new security features, while others, like the launch of the company’s Atlas Data Lake, point the company beyond its core database product.
“Our new offerings radically expand the ways developers can use MongoDB to better work with data,” said Dev Ittycheria, the CEO and President of MongoDB. “We strive to help developers be more productive and remove infrastructure headaches — with additional features along with adjunct capabilities like full-text search and data lake. IDC predicts that by 2025 global data will reach 175 Zettabytes and 49% of it will reside in the public cloud. It’s our mission to give developers better ways to work with data wherever it resides, including in public and private clouds.”
The highlight of today’s set of announcements is probably the launch of MongoDB Atlas Data Lake. Atlas Data Lake allows users to query data, using the MongoDB Query Language, on AWS S3, no matter their format, including JSON, BSON, CSV, TSV, Parquet and Avro. To get started, users only need to point the service at their existing S3 buckets. They don’t have to manage servers or other infrastructure. Support for Data Lake on Google Cloud Storage and Azure Storage is in the works and will launch in the future.
Also new is Full-Text Search, which gives users access to advanced text search features based on the open-source Apache Lucene 8.
In addition, MongoDB is also now starting to bring together Realm, the mobile database product it acquired earlier this year, and the rest of its product lineup. Using the Realm brand, Mongo is merging its serverless platform, MongoDB Stitch, and Realm’s mobile database and synchronization platform. Realm’s synchronization protocol will now connect to MongoDB Atlas’ cloud database, while Realm Sync will allow developers to bring this data to their applications.
“By combining Realm’s wildly popular mobile database and synchronization platform with the strengths of Stitch, we will eliminate a lot of work for developers by making it natural and easy to work with data at every layer of the stack, and to seamlessly move data between devices at the edge to the core backend,” explained Eliot Horowitz, the CTO and co-founder of MongoDB.
As for the latest release of MongoDB, the highlight of the release is a set of new security features. With this release, Mongo is implementing client-side Field Level Encryption. Traditionally, database security has always relied on server-side trust. This typically leaves the data accessible to administrators, even if they don’t have client access. If an attacker breaches the server, that’s almost automatically a catastrophic event.
With this new security model, Mongo is shifting access to the client and to the local drivers. It provides multiple encryptions options and for developers to make use of this, they will use a new ‘encrypt’ JSON scheme attribute.
This ensures that all application code can generally run unmodified and even the admins won’t get access to the database or its logs and backups unless they get client access rights themselves. Since the logic resides in the drivers, the encryption is also handled totally separate from the actual database.
Other new features in MongoDB 4.2 include support for distributed transactions and the ability to manage MongoDB deployments from a single Kubernetes control plane.
Google today announced that it has partnered with a number of top open-source data management and analytics companies to integrate their products into its Google Cloud Platform and offer them as managed services operated by its partners. The partners here are Confluent, DataStax, Elastic, InfluxData, MongoDB, Neo4j and Redis Labs.
The idea here, Google says, is to provide users with a seamless user experience and the ability to easily leverage these open-source technologies in Google’s cloud. But there is a lot more at play here, even though Google never quite says so. That’s because Google’s move here is clearly meant to contrast its approach to open-source ecosystems with Amazon’s. It’s no secret that Amazon’s AWS cloud computing platform has a reputation for taking some of the best open-source projects and then forking those and packaging them up under its own brand, often without giving back to the original project. There are some signs that this is changing, but a number of companies have recently taken action and changed their open-source licenses to explicitly prevent this from happening.
That’s where things get interesting, because those companies include Confluent, Elastic, MongoDB, Neo4j and Redis Labs — and those are all partnering with Google on this new project, though it’s worth noting that InfluxData is not taking this new licensing approach and that while DataStax uses lots of open-source technologies, its focus is very much on its enterprise edition.
“As you are aware, there has been a lot of debate in the industry about the best way of delivering these open-source technologies as services in the cloud,” Manvinder Singh, the head of infrastructure partnerships at Google Cloud, said in a press briefing. “Given Google’s DNA and the belief that we have in the open-source model, which is demonstrated by projects like Kubernetes, TensorFlow, Go and so forth, we believe the right way to solve this it to work closely together with companies that have invested their resources in developing these open-source technologies.”
So while AWS takes these projects and then makes them its own, Google has decided to partner with these companies. While Google and its partners declined to comment on the financial arrangements behind these deals, chances are we’re talking about some degree of profit-sharing here.
“Each of the major cloud players is trying to differentiate what it brings to the table for customers, and while we have a strong partnership with Microsoft and Amazon, it’s nice to see that Google has chosen to deepen its partnership with Atlas instead of launching an imitation service,” Sahir Azam, the senior VP of Cloud Products at MongoDB told me. “MongoDB and GCP have been working closely together for years, dating back to the development of Atlas on GCP in early 2017. Over the past two years running Atlas on GCP, our joint teams have developed a strong working relationship and support model for supporting our customers’ mission critical applications.”
As for the actual functionality, the core principle here is that Google will deeply integrate these services into its Cloud Console; for example, similar to what Microsoft did with Databricks on Azure. These will be managed services and Google Cloud will handle the invoicing and the billings will count toward a user’s Google Cloud spending commitments. Support will also run through Google, so users can use a single service to manage and log tickets across all of these services.
Redis Labs CEO and co-founder Ofer Bengal echoed this. “Through this partnership, Redis Labs and Google Cloud are bringing these innovations to enterprise customers, while giving them the choice of where to run their workloads in the cloud, he said. “Customers now have the flexibility to develop applications with Redis Enterprise using the fully integrated managed services on GCP. This will include the ability to manage Redis Enterprise from the GCP console, provisioning, billing, support, and other deep integrations with GCP.”
AWS launchedDocumentDB today, a new database offering that is compatible with the MongoDB API. The company describes DocumentDB as a “fast, scalable, and highly available document database that is designed to be compatible with your existing MongoDB applications and tools.” In effect, it’s a hosted drop-in replacement for MongoDB that doesn’t use any MongoDB code.
AWS argues that while MongoDB is great at what it does, its customers have found it hard to build fast and highly available applications on the open-source platform that can scale to multiple terabytes and hundreds of thousands of reads and writes per second. So what the company did was build its own document database, but made it compatible with the Apache 2.0 open source MongoDB 3.6 API.
If you’ve been following the politics of open source over the last few months, you’ll understand that the optics of this aren’t great. It’s also no secret that AWS has long been accused of taking the best open-source projects and re-using and re-branding them without always giving back to those communities.
The wrinkle here is that MongoDB was one of the first companies that aimed to put a stop to this by re-licensing its open-source tools under a new license that explicitly stated that companies that wanted to do this had to buy a commercial license. Since then, others have followed.
“Imitation is the sincerest form of flattery, so it’s not surprising that Amazon would try to capitalize on the popularity and momentum of MongoDB’s document model,” MongoDB CEO and president Dev Ittycheria told us. “However, developers are technically savvy enough to distinguish between the real thing and a poor imitation. MongoDB will continue to outperform any impersonations in the market.”
That’s a pretty feisty comment. Last November, Ittycheria told my colleague Ron Miller that he believed that AWS loved MongoDB because it drives a lot of consumption. In that interview, he also noted that “customers have spent the last five years trying to extricate themselves from another large vendor. The last thing they want to do is replay the same movie.”
MongoDB co-founder and CTO Eliot Horowitz echoed this. “In order to give developers what they want, AWS has been pushed to offer an imitation MongoDB service that is based on the MongoDB code from two years ago,” he said. “Our entire company is focused on one thing — giving developers the best way to work with data with the freedom to run anywhere. Our commitment to that single mission will continue to differentiate the real MongoDB from any imitation products that come along.”
A company spokesperson for MongoDB also highlighted that the 3.6 API that DocumentDB is compatible with is now two years old and misses most of the newest features, including ACID transactions, global clusters and mobile sync.
To be fair, AWS has become more active in open source lately and, in a way, it’s giving developers what they want (and not all developers are happy with MongoDB’s own hosted service). Bypassing MongoDB’s licensing by going for API comparability, given that AWS knows exactly why MongoDB did that, was always going to be a controversial move and won’t endear the company to the open-source community.
Cosmos DB is undoubtedly one of the most interesting products in Microsoft’s Azure portfolio. It’s a fully managed, globally distributed multi-model database that offers throughput guarantees, a number of different consistency models and high read and write availability guarantees. Now that’s a mouthful, but basically, it means that developers can build a truly global product, write database updates to Cosmos DB and rest assured that every other user across the world will see those updates within 20 milliseconds or so. And to write their applications, they can pretend that Cosmos DB is a SQL- or MongoDB-compatible database, for example.
CosmosDB officially launched in May 2017, though in many ways it’s an evolution of Microsoft’s existing Document DB product, which was far less flexible. Today, a lot of Microsoft’s own products run on CosmosDB, including the Azure Portal itself, as well as Skype, Office 365 and Xbox.
Today, Microsoft is extending Cosmos DB with the launch of its multi-master replication feature into general availability, as well as support for the Cassandra API, giving developers yet another option to bring existing products to CosmosDB, which in this case are those written for Cassandra.
Microsoft now also promises 99.999 percent read and write availability. Previously, it’s read availability promise was 99.99 percent. And while that may not seem like a big difference, it does show that after more of a year of operating Cosmos DB with customers, Microsoft now feels more confident that it’s a highly stable system. In addition, Microsoft is also updating its write latency SLA and now promises less than 10 milliseconds at the 99th percentile.
“If you have write-heavy workloads, spanning multiple geos, and you need this near real-time ingest of your data, this becomes extremely attractive for IoT, web, mobile gaming scenarios,” Microsoft CosmosDB architect and product manager Rimma Nehme told me. She also stressed that she believes Microsoft’s SLA definitions are far more stringent than those of its competitors.
The highlight of the update, though, is multi-master replication. “We believe that we’re really the first operational database out there in the marketplace that runs on such a scale and will enable globally scalable multi-master available to the customers,” Nehme said. “The underlying protocols were designed to be multi-master from the very beginning.”
Why is this such a big deal? With this, developers can designate every region they run Cosmos DB in as a master in its own right, making for a far more scalable system in terms of being able to write updates to the database. There’s no need to first write to a single master node, which may be far away, and then have that node push the update to every other region. Instead, applications can write to the nearest region, and Cosmos DB handles everything from there. If there are conflicts, the user can decide how those should be resolved based on their own needs.
Nehme noted that all of this still plays well with CosmosDB’s existing set of consistency models. If you don’t spend your days thinking about database consistency models, then this may sound arcane, but there’s a whole area of computer science that focuses on little else but how to best handle a scenario where two users virtually simultaneously try to change the same cell in a distributed database.
Unlike other databases, Cosmos DB allows for a variety of consistency models, ranging from strong to eventual, with three intermediary models. And it actually turns out that most CosmosDB users opt for one of those intermediary models.
Interestingly, when I talked to Leslie Lamport, the Turing award winner who developed some of the fundamental concepts behind these consistency models (and the popular LaTeX document preparation system), he wasn’t all that sure that the developers are making the right choice. “I don’t know whether they really understand the consequences or whether their customers are going to be in for some surprises,” he told me. “If they’re smart, they are getting just the amount of consistency that they need. If they’re not smart, it means they’re trying to gain some efficiency and their users might not be happy about that.” He noted that when you give up strong consistency, it’s often hard to understand what exactly is happening.
But strong consistency comes with its drawbacks, too, which leads to higher latency. “For strong consistency there are a certain number of roundtrip message delays that you can’t avoid,” Lamport noted.
The CosmosDB team isn’t just building on some of the fundamental work Lamport did around databases, but it’s also making extensive use of TLA+, the formal specification language Lamport developed in the late 90s. Microsoft, as well as Amazon and others, are now training their engineers to use TLA+ to describe their algorithms mathematically before they implement them in whatever language they prefer.
“Because [CosmosDB is] a massively complicated system, there is no way to ensure the correctness of it because we are humans, and trying to hold all of these failure conditions and the complexity in any one person’s — one engineer’s — head, is impossible,” Microsoft Technical Follow Dharma Shukla noted. “TLA+ is huge in terms of getting the design done correctly, specified and validated using the TLA+ tools even before a single line of code is written. You cover all of those hundreds of thousands of edge cases that can potentially lead to data loss or availability loss, or race conditions that you had never thought about, but that two or three years ago after you have deployed the code can lead to some data corruption for customers. That would be disastrous.”
“Programming languages have a very precise goal, which is to be able to write code. And the thing that I’ve been saying over and over again is that programming is more than just coding,” Lamport added. “It’s not just coding, that’s the easy part of programming. The hard part of programming is getting the algorithms right.”
Lamport also noted that he deliberately chose to make TLA+ look like mathematics, not like another programming languages. “It really forces people to think above the code level,” Lamport noted and added that engineers often tell him that it changes the way they think.
As for those companies that don’t use TLA+ or a similar methodology, Lamport says he’s worried. “I’m really comforted that [Microsoft] is using TLA+ because I don’t see how anyone could do it without using that kind of mathematical thinking — and I worry about what the other systems that we wind up using built by other organizations — I worry about how reliable they are.”
Data is the lifeblood of the modern corporation, yet acquiring, storing, processing, and analyzing it remains a remarkably challenging and expensive project. Every time data infrastructure finally catches up with the streams of information pouring in, another source and more demanding decision-making makes the existing technology obsolete.
Few cities rely on data the same way as New York City, nor has any other city so shaped the technology that underpins our data infrastructure. Back in the 1960s, banks and accounting firms helped to drive much of the original computation industry with their massive finance applications. Today, that industry has been supplanted by finance and advertising, both of which need to make microsecond decisions based on petabyte datasets and complex statistical models.
Unsurprisingly, the city’s hunger for data has led to waves of database companies finding their home in the city.
As web applications became increasingly popular in the mid-aughts, SQL databases came under increasing strain to scale, while also proving to be inflexible in terms of their data schemas for the fast-moving startups they served. That problem spawned Manhattan-based MongoDB, whose flexible “NoSQL” schemas and horizontal scaling capabilities made it the default choice for a generation of startups. The company would go on to raise $311 million according to Crunchbase, and debuted late last year on NASDAQ, trading today with a market cap of $2 billion.
At the same time that the NoSQL movement was hitting its stride, academic researchers and entrepreneurs were exploring how to evolve SQL to scale like its NoSQL competitors, while retaining the kinds of features (joining tables, transactions) that make SQL so convenient for developers.
One leading company in this next generation of database tech is New York-based Cockroach Labs, which was founded in 2015 by a trio of former Square, Viewfinder, and Google engineers. The company has gone on to raise more than $50 million according to Crunchbase from a luminary list of investors including Peter Fenton at Benchmark, Mike Volpi at Index, and Satish Dharmaraj at Redpoint, along with GV and Sequoia.
While web applications have their own peculiar data needs, the rise of the internet of things (IoT) created a whole new set of data challenges. How can streams of data from potentially millions of devices be stored in an easily analyzable manner? How could companies build real-time systems to respond to that data?
Mike Freedman and Ajay Kulkarni saw that problem increasingly manifesting itself in 2015. The two had been roommates at MIT in the late 90s, and then went on separate paths into academia and industry respectively. Freedman went to Stanford for a PhD in computer science, and nearly joined the spinout of Nicira, which sold to VMware in 2012 for $1.26 billion. Kulkarni joked that “Mike made the financially wise decision of not joining them,” and Freedman eventually went to Princeton as an assistant professor, and was awarded tenure in 2013. Kulkarni founded and worked at a variety of startups including GroupMe, as well as receiving an MBA from MIT.
The two had startup dreams, and tried building an IoT platform. As they started building it though, they realized they would need a real-time database to process the data streams coming in from devices. “There are a lot of time series databases, [so] let’s grab one off the shelf, and then we evaluated a few,” Kulkarni explained. They realized what they needed was a hybrid of SQL and NoSQL, and nothing they could find offered the feature set they required to power their platform. That challenge became the problem to be solved, and Timescale was born.
In many ways, Timescale is how you build a database in 2018. Rather than starting de novo, the team decided to build on top of Postgres, a popular open-source SQL database. “By building on top of Postgres, we became the more reliable option,” Kulkarni said of their thinking. In addition, the company opted to make the database fully open source. “In this day and age, in order to get wide adoption, you have to be an open source database company,” he said.
Far more important though are their customers, who are definitely not the typical tech startup roster and include companies from oil and gas, mining, and telecommunications. “You don’t think of them as early adopters, but they have a need, and because we built it on top of Postgres, it integrates into an ecosystem that they know,” Freedman explained. Kulkarni continued, “And the problem they have is that they have all of this time series data, and it isn’t sitting in the corner, it is integrated with their core service.”
New York has been a strong home for the two founders. Freedman continues to be a professor at Princeton, where he has built a pipeline of potential grads for the company. More widely, Kulkarni said, “Some of the most experienced people in databases are in the financial industry, and that’s here.” That’s evident in one of their investors, hedge fund Two Sigma. “Two Sigma had been the only venture firm that we talked to that already had built out their own time series database,” Kulkarni noted.
The two also benefit from paying customers. “I think the Bay Area is great for open source adoption, but a lot of Bay Area companies, they develop their own database tech, or they use an open source project and never pay for it,” Kulkarni said. Being in New York has meant closer collaboration with customers, and ultimately more revenues.
Open source plus revenues. It’s the database way, and the next wave of innovation in the NYC enterprise infrastructure ecosystem.