I recently caught up with Jamie Engesser, VP Product Mangement at Hortonworks during the company’s DataWorks Summit 2018 conference in San Jose, California, to get an update on his company’s direction and his sense for the pulse of the big data industry. Jamie brings over 20 years of software industry leadership to Hortonworks. Most recently, Jamie had global responsibility for Hortonworks Solutions Engineering organization which is focused on guiding organizations to identify their Hadoop opportunity from Business Case, to Proof of Concept, to successful Project Delivery. Prior to Hortonworks, Jamie led Global Solutions Engineering teams at SpringSource and VMware. Jamie has extensive experience spanning Open Source, Java, Platform as a Service (PaaS), Application Infrastructure and Big Data. He holds a Bachelor of Science in Industrial Engineering from Montana State University.
insideBIGDATA: Hortonworks had a number of important announcements for DataWorks Summit. Can you drill down on some of the big movements this past year?
Jamie Engesser: Yes! Like the first one about agile application deployment via containerization; I think that’s such a hot topic right now. The big thing we’re doing in HDP 3 correlates to two things. The first is people want to push more interesting data workloads onto the cluster, whether it’s deep learning, AI, those types of workloads. The notion of containerizing on top of a Hadoop 3 is a really big deal. Now it lets people take those interesting workloads, push them onto the cluster, and have them data local. The second is, as I push those workloads down, leveraging GPUs to optimize the query is very important as well. So those are the two big things I would say.
insideBIGDATA: The part of the message pertaining to the handling deep learning workloads interests me because in addition to being a journalist, I’m also a practicing data scientist. I like to see solutions like this coming from companies like Hortonworks. So how is that going to affect data scientists and people who actually use this technology, coupled with your product?
Jamie Engesser: It’s a matter of a little looking back to look forward. If you look at data scientists today, they tend to take a bunch of GPUs in boxes, put them under their desktops, and that’s where they work. And they tend to use an offline set of data. They take a small set of data, bring it local, and then use the GPUs to go compute algorithms, etc. That model works, but it works at small scale.
If you look at the situation as I scale up to an enterprise, and as I scale up to 20, 30, 40 data scientists working as a team, that model breaks down very rapidly for two reasons. One is, just grabbing the data and taking it is really hard to control for things like General Data Protection Regulation (GDPR). A number of the regulations out there cause problems, so they don’t want to take the data out of the cluster and go do work. The second thing is, it doesn’t allow sharing. If I have 40 data scientists, I’d rather prioritize that workload and push the right workloads onto the cluster. If you look at our 3.0 discharge, it gives you those two things. It lets you run that workload on the cluster and it lets you, and then not have to take all the data out of the cluster and put it somewhere else. So I get everything running with the data in the cluster. It’s a really big win for the data scientists, faster access to more compute resources and faster data.
insideBIGDATA: Let’s talk a little about real-time databases, improved query optimization, and so forth, enabled via Apache Hive. Can you talk more about that and how it’s going to solve problems for enterprises?
Jamie Engesser: You bet. So we’ve done a lot of work on this. Again, let’s look back to look forward. If you look at when I started Hortonworks six years ago, there was TPC benchmarks (the query benchmarks out there for data sets) which would run in 3,600 seconds. That’s an hour. That same query today runs in about 1.8 seconds. So when you look at Hive 3, we’ve done so much innovation with it, and Hadoop 3, that we’re now running what was an hour in 1.8 seconds today. So that’s a huge improvement in what we’ve been able to do. So number one, Hive being able to do this really complex querying has been a big win for us. And being able to support all the query standards that are out there, that’s a big win. And the number two win is, historically, Hive and Hadoop is an append-only file system. The only reason you care about that is, historically, we were select-only. We were not full CRUD, or create, read, update, delete. There was no notion of an update or a delete. What you did was, you had to rebuild the table with less rows. What we’ve done in this discharge is we’ve allowed you to have full access semantics so now you can not only create data, read data, but also update and delete it. Now when I go back to GDPR, if I have 20 million subscribers or 20 billion rows in a table and Jamie Engesser says, “I want the right to be forgotten as part of the GDPR requirements,” I can delete one row on a table and continue processing. That could never happen before. No one else in the Hadoop ecosystem allows you to do that. We have just given you that capability with Hadoop 3. This is another big win.
insideBIGDATA: You mentioned GDPR which is huge in the last month since it was adopted in the European Union. What kind of enhanced security and governance features have you added to the products since last year?
Jamie Engesser: Great question. So there’s a few things. Number one is Apache Ranger which is our core security layer. The things we’ve done in Apache Ranger that are interesting is we’ve allowed tag-based policies. What is “tag-based policies?” Why do you care about that? Because if you think about data through a life cycle in a Hadoop landscape, that means from the time it was created, potentially on the edge of an IoT device, we use NiFi to stream the data into the cluster. As it’s coming in, maybe Kafka does some streaming analytics on it, it lands in HDFS, it becomes a table in Hive, and then somebody does a query or somebody runs a Spark job on it. You need to think about that life cycle. Ranger doesn’t require me to put a specific policy on that row or column, but instead, I just tag the data and it will automatically interpret that for the entire life cycle. For Ranger to be able to use tag-based policies has been a really big win. In addition, that lineage that I just talked about is part of Atlas. It’s the governance engine that maintains that lineage for you. So those two things have been big wins that we’ve evolved to support tag-based policies in the lineage.
The other area we added is Data Steward Studio about two months ago on top of Ranger and Atlas. What Data Steward Studio did is it took the ability for – Hortonworks has about 1.5 exabytes of data, is what our customers run on-prem. With that 1.5 exabytes of data, if you think about it, as part of GDPR requirements and security governance requirements, I have to be able to define all the sensitive data in that. Think about how long that’s going to take anybody to accomplish. I could have a hundred people working on it for the rest of their lives and they’ll probably never catch up because we have 1.5 exabytes of data. But we’re also generating new data at such a rate that they just won’t keep up. So one of the things we did with Data Steward Studio is we give you dynamic profiling. We’ll automatically look at your data, automatically profile your data to what are the sensitive concerns, and take those sensitive concerns and automatically tag it for you. You instantly get security because of dynamic profiling. This is really cool stuff. Now, for the first time, you will automatically index, define sensitive data, and help you manage your GDPR requirements. So that’s a really big deal.
insideBIGDATA: I would think a lot of your customers are going to be very interested in this feature set.
Jamie Engesser: We were over in Berlin two months ago and we did a main stage demo of it and literally people were clapping as I was going through the demo. So, yes, there’s an enthusiasm level. Europeans are living with it daily. There are a number of international companies in the US that have to deal with it but are, I would argue, moving slower. But I think it opens up to the original discussion on the data science side because what it does is it opens up a situation where a data scientist can only leverage the data that they have access to. In the average enterprise, for data that they’re not confident in, they just take it away from everybody and say, “We don’t know what that data’s going to do so I’m not going to give it to you.” Now it forces them to prioritize that data, identify that data, puts sets of data controls on it. In addition, it opens the door to the 80 or 90 percent of the enterprise data that’s not sensitive. So now the actual data scientists get access to more data because we’ve identified that it’s not sensitive so therefore they can consume it. Frankly, GDPR is a big opportunity for data scientists in Europe to get access to more data longer term.
insideBIGDATA: When is general availability?
Jamie Engesser: We just launched early access and GA is in Q3.
insideBIGDATA: Are there any other features in this discharge that we didn’t cover yet?
Jamie Engesser: I think the other thing that I would add is the cloud side. We’ve done a lot of work in HDP 3 to make it really good for data scientists, improve the security and governance model. And as part of that, we’ve also made it so it runs really well in any one of the cloud providers, whether that’s Google, whether that’s Amazon, whether that’s Azure, we want it to run great. We’ve extended to support all Google cloud storage connectors, as well as the Amazon, Azure, and Wasabi connectors. So now we can run your data platform in any one of those plus on-prem and it should all be transparent to you.
insideBIGDATA: IBM is another one, yes?
Jamie Engesser: Thank you. Yes, and IBM, and IBM’s hosted analytics cloud.
insideBIGDATA: It appears that the company’s been hard at work in the last year. What can you say about the future? What directions do you see? What’s in the works? What do you see happening for the next DataWorks Summit?
Jamie Engesser: I think we’re pushing really hard in the DataPlane space. If you think about Hortonworks, we’ve historically been a really strong technology-first company. One of the things we’re trying to do is make that technology more applicable to the business user, whether that’s business analyst, data scientist, you name it. So we’re pushing hard on the DataPlane layer to take it to the next level, which means, as a business analyst, I come in, I can discover new data, find the data I care about, go launch workloads, whether in the cloud or on-prem, make sure that anything that I deal with is secured, and governed. You’re going to see us continue to push hard on that.
If you look at what we’ve done so far, we’ve introduced Data Life Cycle Manager and Data Storage Studio into that DataPlane service layer. You’re going to see us pushing Data Analytics Studio into it, and Streaming Messaging Manager into it, to give you more capability to control your data, no matter where it lands. So you’ll see that the DataPlane layer will be fundamental for us as we move forward.
insideBIGDATA: And lastly, I know it’s only the first day, but what’s your impression about DataWorks Summit this year in comparisons with previous years?
Jamie Engesser: Maybe I’ll give you a little bit different perspective than you probably see. I run our customer advisory board. We had our customer advisory board meeting on Monday. We’ve had the biggest customer advisory board we ever had, from the biggest Fortune 1000 out there. When I look at that, we had 58 of our key customers out there. When I look at the questions they answered and the depth that they understand the platform, evolve the platform, solve their use cases, I get super excited. So when I look at DataWorks Summit, it’s less the volume of people. It’s more the level of the people that are here solving complex sort of problems. I’d say it moved from the “Bay Area cool kids” mentality to the enterprise customers and enterprise people. When you stop and talk to somebody in the hall, I’ll bet you, 9 out of 10 times, it’s going to be somebody like a Boeing, and the list is long. I think that’s a good thing for us. It really shows that we’re moving from this early stage, it’s hip and trendy, to the real world of solving real-world business use cases. That’s what I get most excited about.
insideBIGDATA: Just from walking around the Community Expo area myself, it appears that you have quality partners. The companies down there all speak very highly of their relationship with Hortonworks. The other thing I saw was that they also have very important big customers that use whatever technology they have coupled with Hortonworks.
Jamie Engesser: I think it’s like anything. There are a few reasons why customers are successful with any technology and one of the biggest reasons is it works well in their ecosystem. I think our partnerships are really what drive our enterprise adoption. It again goes back to, “I’ve got a problem to solve. I’m not here to just buy technology or play with technology. I’ve got a true problem to solve.” And I think having strong partners helps the customer solve those problems faster. So our Partner Works program that we run has been really, really good and that’s how we bring our partners in, help our partners to ramp up, understand the technology, go through the certifications, and ultimately, work with our customers to be successful.
Contributed by Daniel D. Gutierrez, Managing Editor and Resident Data Scientist of insideBIGDATA. In addition to being a tech journalist, Daniel also is a practicing data scientist, author, educator and sits on a number of advisory boards for various start-up companies.
Sign up for the free insideBIGDATA newsletter.