
Techzine Talks on Tour
Techzine Talks on Tour is a podcast series recorded on location at the events Coen and Sander attend all over the world. A spin-off of the successful Dutch series Techzine Talks, this new English series aims to reach new audiences.
Each episode is an approximately 30-minute discussion that Coen or Sander has with a high-level executive of a technology company. The episodes are single-take affairs, and we don't (or hardly) edit them afterwards, apart from polishing the audio up a bit of course. This way, you get an honest, open discussion where everyone speaks their mind on the topic at hand.
These topics vary greatly, as Coen and Sander attend a total of 50 to 60 events each year, ranging from open-source events like KubeCon to events hosted by Cisco, IBM, Salesforce and ServiceNow, to name only a few. With a lot of experience in many walks of IT life, Coen and Sander always manage to produce an engaging, in-depth discussion on general trends, but also on technology itself.
So follow Techzine Talks on Tour and stay in the know. We might just tell you a thing or two you didn't know yet, but which might be very important for your next project or for your organization in general. Stay tuned and follow Techzine Talks on Tour.
Techzine Talks on Tour
Amazon S3: almost 20 years old, but still very modern
Amazon S3 is the oldest service in the catalogue of Amazon Web Services (AWS). We sat down with Andy Warfield, Distinguished Engineer at AWS, to talk about the 19-year journey of Amazon S3 from simple backup solution to sophisticated data foundation.
Warfield talks about how S3 began as essentially "a storage locker across town" for archival purposes, before customers discovered its REST architecture made it capable of handling massive parallel workloads. This unexpected advantage fueled S3's growth beyond unstructured data into analytics and AI domains. The introduction of columnar formats like Parquet and open table formats like Iceberg transformed what was possible, culminating in AWS's recent launch of S3 Tables.
Besides the general direction Amazon S3 has taken and takes, we also talk about other challenges related to current developments. One of them is how AWS and Amazon S3 handles the bloat that is associated with vectorizing databases for use in AI use cases. This challenge is driving significant innovation at AWS to optimize storage for the AI era.
All in all, Amazon S3 has had to evolve continuously to mirror the broader developments in how businesses approach data - from static archives to dynamic, queryable assets powering real-time decision making.
The final topic we talk about is the future of Amazon S3. Warfield wants it to be "pulled closer to application code" with improved performance and more flexible access methods to ensure customers don't have to choose upfront how they'll use their data.
We are slightly biased of course, but we think this is a very good episode of Techzine Talks, and highly recommend anyone interested in the topic to listen to it. Tune in now!
Welcome to this new episode of Techzine Talks on Tour. I'm at AWS Summit in Amsterdam and I'm here with Andy Warfield. Hi, andy, welcome to the show.
Speaker 2:Thank you Thanks for joining us.
Speaker 1:You're a distinguished engineer at AWS. What does that mean?
Speaker 2:First question. It means I have elbow patches on my jacket. I didn't put the jacket with the elbow patches on today.
Speaker 1:Okay, it means you've been working at a certain kind of, or you've been working at AWS.
Speaker 2:Aws. Amazon actually does a really cool thing with the engineering job family, which is that in a lot of more traditional tech companies you know you can only go so far as an engineer and then you inevitably move to like a more manager director track. At Amazon, the IC job family runs all the way to the VP, to the exec level, and so I'm an engineer. I even get to write code some of the time, but I work across all of the storage and data teams at Amazon, but I'm an exec so that clears that up.
Speaker 1:That's good, because I've always wanted to know. So you mostly work on S3, right?
Speaker 2:Primarily S3. I work across the storage teams.
Speaker 1:Yeah, so let's talk about that then, because S3, together with EC2, I think, were the first two kind of services that AWS launched in 2006.
Speaker 2:That's right, S3 was first. S3 was first. Oh yeah, this is an internal point.
Speaker 1:That's right, S3 was first. S3 was first.
Speaker 2:Oh, you want to make a point. This is an internal point. You want to make a point of that?
Speaker 1:So it's an old service, but it's still quite it's 18 years old. Yeah, but it's still quite. It's still modern, right, so it's still being heavily used. So let's start off with a brief sort of history of S3. Some landmark moments. How did it progress over the past 18 years?
Speaker 2:Well, I mean, it's kind of an amazing story. S3 was, I think, born from necessity. In AWS, I think a lot of the early like invention that led to AWS itself launching was building services for the retail side of Amazon. I think that they noticed that storage was kind of a distracting source of engineering, that they had to keep building these independent storage systems and they saw at the time how much the service was building on top of REST APIs and so decided to build a REST-based storage service, building on top of REST APIs, and so decided to build a REST-based storage service. And progressively the way that people have used S3 has sort of changed.
Speaker 2:In the early days, I think, s3, the model that I have in, at least my imagination for the original S3 was that it was kind of like one of those like storage facilities on the outskirts of town that you like put things in your trunk and you drive to and you put it in. It was largely archival. People would like push backups to S3 and things like that. So what is now? Glacier? Basically Kind of like, I mean, and Glacier was really born out of that use case needing an even lower cost sort of offering. But a lot of the data was relatively cold and even the data that was in s3 originally was relatively cool.
Speaker 2:And then, as we moved into the I guess, mid 2010s, we really started to see the nature of the way that people used us three change. Yeah, I think some of that started with client integrations, so there were a lot of workloads running against HDFS with Hadoop at the time. And then the Apache Hadoop folks wrote a connector called S3A that allowed you to use S3 directly instead of the HDFS file system and then that kind of led to Hadoop workloads being able to shift to run in the cloud and ultimately Spark adopted SVA as well.
Speaker 1:That kind of also led to the disappearance of Hadoop, more or less, and so that kind of led to a bunch of things kind of happening all at once.
Speaker 2:I think that on the analytics side, you saw Parquet really start to become a dominant or like a popular storage format and analytics workloads which led to I'll load, but then in a bunch of adjacent verticals, things like genomics and media entertainment, you started you know much used to be a big Hadoop kind of kind of kind of use case as well.
Speaker 2:Uh-huh, you took that a bit, and also a lot of the genomics stands out as having a lot of Linux-based tools there's like this JTK toolkit and stuff but massively parallel workloads. A lot of genomics is just search across loads and loads of large data files and so yeah, but then S3 has always been seen, at least to me.
Speaker 1:when people talked about S3, at least until a couple of years ago, it was seen as the de facto standard for unstructured data, so that was the connector everybody wanted. But that's not true anymore, right? It hasn't been unstructured data for quite a long time alone.
Speaker 2:It's been a lot of. I mean, it has always been unstructured and there's still a huge population but you're totally right. The interesting thing with that sort of parallel track of SBA and Parquet that started probably around, I don't know, 2012 or something like that really saw a lot of structured data stored in S3. Not so much low latency transactional operational databases but data warehousing type workloads and analytics workloads and really that's just kind of grown.
Speaker 1:And then I guess Was S3 suitable or was it a good landing place for structured data? How much did you have to do to actually accommodate that?
Speaker 2:I think the thing that in some sense, as customers realized before we did, was that the fact that S3 is structured as a load-balanced, like REST-presenting web server makes it a much more parallel storage system than any storage system that we kind of had leading up to that. Even HPC-based storage systems in a lot of cases were tricky to scale up to really, really big apertures, whereas the rest nature of S3 made it very easy for applications to just scale load across endpoints, and so for applications that had relatively variable latency requirements, they were able to tolerate longer latencies Because that was the thing, right.
Speaker 1:Unstructured data, that wasn't necessarily the fastest type of store. You could get an unstructured object store, for example. But so if you didn't need all the high latency, the low latency that was fine. But I can imagine that when you're looking at structured data and applications that do need low latency, it does create sort of a stress on the initial S3 proposition, right.
Speaker 2:Totally Well. There's two things to what you're saying. The first one is I wouldn't say that those applications didn't need the fastest storage. They didn't need the quickest storage. Maybe, to draw a distinction, what they often were doing was driving like terabits, tens or even in some cases, hundreds of terabits a second of traffic to S3. So, from a throughput perspective, they're pulling enormous amounts of data through, but they're tolerant of that first byte arriving slower. But, like you say, there are other workloads for which they need the latency. They need quick storage and that's what led us to launch Express One Zone a few years ago.
Speaker 1:Okay, so now that's what led us to launch Express One Zone a few years ago. Okay, yeah, so now it doesn't really matter. Unstructured data, structured data you can handle everything at the same level of competency.
Speaker 2:I think that the structured data side is kind of evolving really quickly. So, as you point out, we had these analytics workloads running on top of parquet for years and years. The thing that's changed in the past five years, and especially in the past year, is this move toward open table formats, and iceberg in particular. And what those formats do is they take this like thing that's existed in s3 for ages, which is parquet, as like a really static representation of tabular data, and change it into like a table where you can do inserts and deletes and that's a columnar kind of.
Speaker 1:Totally. So how, maybe for our listeners? How is that different from from from row based? How columnar different from row based?
Speaker 2:okay, well, so in both cases you're talking about a table. The distinction is that in a columnar, different from row-based oh, OK, Well, so in both cases you're talking about a table. The distinction is that in a columnar data structure, which Parquet is we've had this for a while rather than, if you imagine, a CSV, a comma-separated list of things, if you write data out in rows like that, frequently a query only operates on a subset of columns. So you'll say select first name and last name from this huge table. And so when you go to process that against rows to get first name and last name, you ultimately have to read all of the rows. And so probably the biggest insight of Parquet was to take those columns, carve them out into blocks and allow you to package a group of rows with their columns clustered together. And so now when I do that select query that only needs a subset of the columns, I can very efficiently read out only that set of column data.
Speaker 1:Okay, yeah, so, and then last, I think last re-invent. You announced something quite interesting, I think.
Speaker 2:The tables, the S3 tables.
Speaker 1:Yep, because that's Iceberg on top of Parquet right.
Speaker 2:Well, the Iceberg spec can use Parquet dominantly, uses Parquet as the underlying data store. So what Iceberg does is it takes Parquet and, like I said, it makes it more of a first-class primitive as a table. So now you can do, you know, update a row with a new value, which in Parquet you can't do. You have to like overwrite the entire file.
Speaker 1:But it's an overlay right. So you're still using the same foundation, but there's an overlay now that actually does more with the same data.
Speaker 2:If you're familiar with Git, Iceberg's actually a lot like Git as a structure, right? So you have like an existing set of data and you do a commit that is a snapshot that contains a patch effectively, and that property is actually both what makes Iceberg amazing and kind of painful.
Speaker 1:So the problem with Iceberg is it creates a lot of complexity.
Speaker 2:Exactly so. If you have a workload that's creates a lot of complexity exactly you. So. If you have a workload that's doing a lot of inserts over time, it actually creates, you know, in some cases thousands or millions of small perky files. Yeah, and now when you go in with spark over s3a or whatever, and you start talking to, I guess, over file io and iceberg, you start talking to your table. What would have been, you know, that single nice read to the column is now like a thousand or a million accesses to different objects. Iceberg added this facility called Compaction.
Speaker 1:Just to get the complexity straight where does this add complexity? Probably in terms of management, or in terms of maybe even the number of the amount of storage that you need. Maybe that, maybe that's even a thing absolutely.
Speaker 2:it emerges as a storage overhead for sure. So if I'm replacing data right, I need to actually go through and process the deletion, because I don't actually delete, I just create new snapshots and it emerges as a performance hit, because now, instead of accessing one object, you know I have to access loads and loads of objects, and my performance is limited by the slowest object that I access. And so, to make Iceberg perform well, iceberg introduced this notion of compaction, which is common in data warehousing, and so you go through and you eliminate a bunch of the snapshots by folding things together.
Speaker 1:But in Iceberg, because it was sort of an open source library, customers that were running Iceberg were having to do their own compaction, and so at best that was a whole bunch of operational work that they didn't want to do Sounds like something that open source suffers from a lot anyway, right, because as long as nobody really takes charge in taking care of that complexity, usually a lot of open source adds a lot of complexity to existing. And it makes sense, because everybody's doing their own thing with it, right, so I get it.
Speaker 2:That's true. There's like an operational burden to doing it and some customers are happy with doing that. At a certain scale there's value in being able to run that stuff, but for the average customer we found that was it was challenging. The other thing that several customers called out was this task of compassion is actually like a pretty sensitive, complex task and because it's managing references inside what's basically a file system, there's a risk of, you know, like messing up references or causing durability related problems and stuff. And so they were asking us like this is your kind of bread and butter, like you're really really good at maintaining and running these systems, but now suddenly some of the complexity is kind of lifted above the API boundary and can you step in? So S3 Tables, which we launched at reInvent, basically introduces a first-class table abstraction that we manage using Iceberg.
Speaker 1:What's the significance of the fact that it's an open-source kind of layer on top of an existing stack that you've built? Is there relevance to it, or did you choose that kind of open-sourceness or the fact that it was open-source, or was it chosen by their customers and because it's the most popular one, you chose that one Does it make sense.
Speaker 2:No, it's a great question, I think I mean, in tables we listen to customers and some of our largest structured data customers were really leaning on Iceberg. I think one of the bits of value that we see is this table abstraction kind of needs code on the client side to really work well. Yeah, and so by using an open standard for it, we really make the effort that's required to write that client side code relevant because it works just everywhere.
Speaker 1:Maybe a dumb question, but are you upstreaming any of this, any of the things that you're working on?
Speaker 2:Yeah, we're working. So I was at the last week. I was in San Francisco at the Iceberg Summit and met with a whole bunch of the folks doing work there.
Speaker 1:So you're also making Iceberg in general better because of what you're doing with S3 Tables.
Speaker 2:Absolutely, absolutely. So. Tonight I'm going for dinner with DuckDB folks. It's based in Amsterdam and they've been doing work to add Iceberg support to. We've been working with them on adding Iceberg support to Duck. It's proving to be a really interesting space.
Speaker 1:So, getting back to S3 Tables, the launch and the initial but what's been the initial reception of it? What are you hearing from the market?
Speaker 2:It's been really, really. I think that a lot of the time when we launch features where we are like a lot of the feature work that we do is driven by customers, but in some cases we launch features where customers are very, very clear that they want it and so we usually go into reInvent and this is across all of AWS, knowing that it's going to be well-received.
Speaker 1:I think we were surprised with S3 Tables how well-received it was. To be honest, for me I think it may not have been the sexiest for want of a better word announcement of re-inventing. I think it was one of the most important ones. Yeah, even though you had the Aurora announcement and all that some very fancy stuff, but this is so fundamental to what a lot of companies do. It's super interesting to see.
Speaker 2:I think we're still kind of internalizing a know like a lot of the, the, the feature and and what it means. But I think one thing that has kind of dawned on us a lot through the launch is the fact that this interface that is kind of in the middle of a database hasn't really existed so much before as a service right. The idea that the storage half of the database can be a first class service and that you're free to bring whatever engine you want to it is really quite kind of interesting.
Speaker 1:It is also quite modern, right. Multimodal, multimodal, you know all these things. They're very hip and happening. Yeah, that's true. So creating layers on top of things that you can connect more things to that is a very modern thing to do. But this is very sort of niche, very deep down.
Speaker 2:It is. But it's interesting that through things like the duck integration and DAF and a lot of these tools, I think the interesting thing that we're starting to see happen with S3 tables and Iceberg is there was this set of applications like Spark, right and a lot of the analytics frameworks that were storing data in Parquet on S3. There were operational databases that were doing change data capture to push data into S3 and then do warehouse style queries on it. And then there were like first class applications that were storing internal application state in Parquet or, you know, in some cases in embedded databases like SQLite. And now they're kind of coming together.
Speaker 1:Yeah, and also, honestly, something needed to be done about all this big, massive data lakes and data warehouses anyway. Right, because there was a time and luckily that's I think that's more or less ended now that most companies just collected lots of data and then decided what to do with it. So if you can streamline even part of that process, that's a big win in itself right Now.
Speaker 2:I think we're still at the beginning of this one.
Speaker 1:Oh, yeah, yeah.
Speaker 2:There's a lot of really exciting work left to do in terms of actually making S3 tables and Iceberg excellent at what it does.
Speaker 1:So what are customers asking for in terms of where should you take it next?
Speaker 2:They're pushing us in a few directions. One is there are a lot of features that exist in S3 for objects that need a little bit of work to bring over to tables in S3 for objects that need a little bit of work to bring over to tables. So things like replication across regions is a thing that is a little bit more complex on tables because there are dependencies between the objects, whereas in unstructured S3 it's not a concern. Performance there are some workloads for Iceberg where performance is not as strong as it needs to be, and so we'll see Iceberg improve on that. And then database features like statistics and stuff like that for performance as well.
Speaker 1:Ben, I think we are almost 20 minutes into this chat and we didn't even mention AI. What's happened? Let's just move this to that direction a little bit. So the impact of AI on S3 in general. What do you see to that direction a little bit? So the impact of AI on S3 in general. What do you see, what does it require from S3?
Speaker 2:So there's actually like a great transition from the tables discussion into a lot of the AI discussion. That's right. Which is that I mean. You mentioned the case of customers storing huge amounts of data in warehouses and stuff.
Speaker 2:A pattern that we are seeing across all sorts of industries is customers have built enormous data sets in S3.
Speaker 2:They might be they're often documents of case studies or manuals for service, or they could be large image or video repositories of things. They're fit for purpose for one sort of set of applications within the business. But the business now realizes that there's enormous value that they've built over years of this data and it opens up opportunities for new applications and gen AI integration. But often the bridge to get from that data to really working firsthand with you know, a foundation model and generating our workloads is they need the ability to augment that data with additional metadata and to search it with RAG retrieval, augmented generation, type tools, and so an area that we're investing a lot of work on right now is at reInvent. We launched a feature called S3 metadata. It uses tables as a construct to build a list of metadata about all the objects in a bucket that you can access over SQL historically unstructured data or object storage has always been very good in metadata right, that's been one of the key differentiators.
Speaker 1:So can you see this as bringing that kind of quality, or the historical quality of object storage or unstructured data or whatever, up higher in the stack towards more of the?
Speaker 2:Okay, yeah, so we basically take that huge you know potentially petabytes of unstructured object data and use the table interface to build an index on top of it. That sounds good, and then customers will customize it and then that acts as a bridge to help search and find the right data and also because all types of files are in there anyway, right?
Speaker 1:So the multimodality of the underlying storage buckets. That really helps as well, because AI is, by definition, getting more and more multimodal as we speak right? Yes, absolutely.
Speaker 2:Well and against the one of the areas where we've seen a lot of the AI models advance is being able to bridge from the natural language to SQL queries, and so by storing metadata in a table, you can actually see interactions where the user will say something to the model in a prompt, the model will know about the tables and it will formulate a SQL query to go and ask the question and then it'll come back and say like here's the object.
Speaker 1:And then, when we talk about AI, you also talk a lot about vectorization, so data vectorization, and that usually creates a lot of bloat, so you increase the volume of your storage or your data quite significantly. I think it's I heard some people say up to 10, 15x or whatever. It can be very, very expensive and expensive. So both of it, Yep, Is that something that you tackle as well? Is there something you can do about this?
Speaker 2:Certainly so. The bloat is certainly. The additional storage capacity that you need to store embeddings is significant, and it's very interestingly variable by media type, and so, interestingly, the largest multiplier, from what we've seen so far, is actually on text. So there's incredible embedding models on video, but because video is large, the embeddings are relatively lower overhead, whereas on text it's so dense with information that there's quite a bit of space needed, and so, yeah, we're doing a lot of work right now looking at what opportunities there are for S3 to help out with those types of workloads.
Speaker 1:Because it is very important, right, absolutely, especially if you start using more and more AI.
Speaker 2:It's almost table stakes for being able to do that.
Speaker 1:You don't want to. At least I wouldn't if I were in an organization. I mean, at least I wouldn't if I were in an organization. I wouldn't want to pay? I don't know 10 times the amount of money for storage, because it's still the same data, it's just you extract more insights from it.
Speaker 2:You're totally right, and it's not just 10 times, because in a lot of cases the vector stores are actually in higher performance storage. The indices that you need to do search on embeddings often carry that data up into DRAM or SSD, and so not only are the vectors larger than the source data, but the storage is considerably more expensive.
Speaker 1:Yeah, and I've also heard that it also matters how big the spaces are between how you vectorize things Right. There are so many niche kind of things that you can work on, yeah, but it is really a point of attention for you as well to actually….
Speaker 2:Yes, this is a place that we're doing a ton of work on right now.
Speaker 1:Well, that sounds good. So, finally, we already talked a little bit about where you're going, but can you just give the listeners a little bit more, maybe? What to expect next from the S3? Sure.
Speaker 2:I mean, I think if we go back to what you started with in terms of the history of S3, you know we had that storage locker across town and now the pull that we saw through analytics and now through like inference paths and generative AI, where you really want to know something out of the set of data you store in S3, like with very low latency, I think what we are seeing is we're seeing S3 continue to be pulled closer and closer to application code, and so I think the two areas that that's really going to drive us to work on over the next five or 10 years is making sure that we have latency and performance that matches those needs that really brings S3 to be a first-class active storage type for those and that we really fill out all of the ways that you might want to access that data, all of the APIs and ways that you connect into data so that you don't have to choose up front how to access it. You should be able to access it however you need.
Speaker 1:Do you still see also maybe integrations with, because a lot of companies are still talking about, look, we still want some on-prem stuff storage for some of our workloads do you see that sort of vanishing or disappearing or moving to the cloud or to your environments as well?
Speaker 2:We certainly have many customers that work across both storage in their own data centers and on us. I think I see those conversations happening not as frequently and a lot of the real drive on storage work seems to be directly through the large scale elastic storage services.
Speaker 1:Yeah, and obviously it also depends where you run your applications right If you run your application in the cloud, it makes sense to have your storage there.
Speaker 2:Absolutely.
Speaker 1:Yeah, all right, well, thanks a lot for sitting down.
Speaker 2:Thanks for having us fun to chat Interesting, I think.