Building a creative AI community: Extended interview
This interview is part of our Season 3 research sprint on creative AI for the music industry.
Assembling our Season 3 research report was a gargantuan effort combining crowdsourced research, hands-on community management, and custom tech development.
In this special extended interview, Water & Music founder Cherie Hu and tech lead Alexander Flores chat through how we designed our Season 3 research sprint from the ground up, and what the experience taught us about the future of music AI and the role of AI in organizational strategy at large. Along the way, you’ll go behind the scenes in what it takes to build a creative AI community, as well as the roles that AI can play in organizational strategy at large.
You can listen via the SoundCloud link below, and/or read along with the full transcript on this page.
Transcript
Cherie Hu: Hi everyone. Thank you all so much for listening to this very special thread of our Season 3 research sprint on Creative AI. You may have found this by looking at the rest of our report, exploring all the other different themes that we’ve been covering, including legal issues, ethical issues, and commercial opportunities and business models around the latest creative AI tools for music.
What isn’t really covered in any of those threads, but what was a really critical part of this season and what I’m very excited to dive into in this episode, was how we created more hands-on experiences around the very tools that we were covering, in a way such that we could directly engage with our members and introduce a lot of them to creative AI tools for the first time in a friendlier, safer, more focused environment. And also come across some interesting takeaways and learnings in the process around community design and trying to build a very fun collaborative research experience around such a fast-moving, fast-evolving technology as creative AI.
With that, I am very thrilled to be joined by our tech lead at Water & Music, Alexander Flores, who joined the team just over a year ago and who’s been the main technical brainchild behind everything that’s been happening around the season — from fact-checking all of our AI articles that go out from a technical standpoint, to helping to build lots of different tools around creative AI that we’ll dive into in this episode.
Alex, thank you so much for joining.
Alexander Flores: Yeah, thanks.
Cherie Hu: As a matter of introduction, would love for you to talk through how you first came across Water & Music, and the scope of things you’re working on right now within the community and the org.
Alexander Flores: Oof. I feel like you could fill a whole hour with that, just context of what we’re doing.
How I came across Water & Music — I think this was pretty early on in the pandemic. Everything had just kinda switched online. I remember wanting to aggregate online events because I was just getting really annoyed having to jump through a bunch of different places to keep up with. And randomly somebody on my Twitter feed had retweeted your database that was already basically doing this. And I was like, oh, cool. So I started following then. And then it wasn’t until one of my artist friends was like, oh, hey, how do I like, keep in touch with my fans? What are different ways to engage? And I remember you had posted one infographic that was really good of all the different platforms that people were exploring. And so I was like, all right, I feel like this is the place where all this info is and we should probably just jump in and see what’s going on there. And so I joined with my buddy; he ended up trailing off, but the community for me was interesting enough that I stuck.
I think right now my main focus is on contextualizing all the information that we’re generating inside the community and making it more approachable. We’re definitely leveraging a lot of ML models and experimenting with them to try and extract information from just the raw content that we’ve been generating, but then also realizing that a lot of the value of the information, even if we have it in some like central database, is not easily accessible or at least put into the context where people would most use it. And so mapping out how we can do that — a lot of explorations around language models and natural language processing, and a bunch of really deep niche topics as far as parsing chats and like segregating out different conversation topics that are overlapping. It gets pretty interesting.
What else are we doing? I feel like we’re doing a lot.
Cherie Hu: We are doing a lot. Always. I mean, you were just getting into this now, but a lot of Discord stuff. Just try to figure out how to make conversations easier to navigate, synthesize that with our articles as you mentioned.
Alexander Flores: Oh yeah. I feel like every DAO’s dealing with this right now, with how many times I’ve heard, ” oh, I hate Discord, but like, where are we gonna go?” And so trying to wrangle some sort of interface that makes it more approachable.
There’s a bunch of references for this in academia. So there’s this guy I follow on Twitter called Max [Krieger], and he had this really interesting paper called “Chatting With Glue.” I remember that was a big inspiration for a lot of people in that community, as far as different ways you could have a conversation and then extract information from it, a more rich dialogue. And then that paired with some other things I saw through Microsoft Research of how you have communities summarize different segments of conversations so that you can dive in as deep as you have time for, as deep as you want to go, but still have the connections and references that are happening.
I think that specific vision… A lot of things that have been tried kind of falls when you need a lot of human input to drive that synthesis. But now with language models, I think it gets a lot more interesting to iterate on those ideas again.
Cherie Hu: Yeah. Yeah, actually a lot of [00:05:00] ways that both of us and also a lot of people on our research team for Season 3 were already using AI tools and getting hands on specifically for information synthesis and not just for music, which I definitely wanna touch upon later.
The Future of Music and Creative AI
Before that, I guess as a way to now segue specifically into thinking about creative AI … I think what’s been really great about having you involved in this season — not just on the technical support or development side, but also on the deep research and kind of context setting side — is that you were super early to pick up on a lot of creative AI developments, especially those that were happening in the visual art and text worlds. The music side, I think that was definitely newer to all of us.
But just thinking with your background going into this season, did you have any personal goals or hypotheses or just things you wanted to learn more about in terms of music specifically, especially comparing it to other domains that you might have been more familiar with?
Alexander Flores: With music specifically, I would say I’m a lot less familiar with the raw information, like representation of AI. I don’t think we ever got into those technical details as part of our output, but… it’s pretty easy to grok how to process an image and how do you build a model around like streams of pixels. But audio, it seems a lot different in a way that I actually didn’t know coming in — was it more complicated or less complicated? Intuitively you wanna say more, because there’s a lot more information. There’s a lot like, like the space, like if you’re talking about a pixel, it’s like zero to 255 times three for RGB. And so there’s a fixed amount of information that can be captured in a pixel. But for a moment of audio, there’s a lot of frequency, like hertz frequencies, but that can be compressed and then re-upscaled. And then also there’s temporal consistency. So like, music isn’t gonna change super drastically from one frame to another. There’s something you can compress across time too. So on that side, I was curious to just kind of dive in more in a focused mode to figure out what was going on there.
The ethics conversations on music and the rights, especially the rights, I was really curious to learn how companies are approaching it. ‘Cause I feel like music as an industry also has a lot more infrastructure built up around that. And so the clash is gonna be a bit more drastic.
Cherie Hu: When you say infrastructure, do you mean like legal infrastructure — like copyright detection, claiming, all that stuff?
Alexander Flores: Yeah, but we also have more unified infrastructure like PROs and just like how you collect rights, and like compulsory rights are already getting assigned to different orgs that track this stuff. There’s a whole institution around managing rights in music that I feel like most other industries don’t have.
Cherie Hu: Yeah. Certainly not in visual, which is kind of a free for all. Was there anything that you learned throughout the process that surprised you or was unexpected in terms of how the music AI landscape has been evolving?
Alexander Flores: To be completely honest, not really. I was a little surprised at how long it took audio to pick up compared to visual. But then again, if you really think about it, it’s not that surprising as far as availability of data and desire, I guess.
This gets into something we’ve talked about before, like people’s need to have music versus their need to generate art or need to generate text on their own. Needing to generate your own music is a lot less necessary for the average person. We already have a glut of music content out there that’s very easily accessible. You can find the music that serves your current needs relatively easily. But to generate visuals or texts, it’s a lot harder or there’s a lot more direct need for people.
So in that sense, I understand why music is behind, although it is just starting to catch up with MusicLM and these music models, so that’s gonna be pretty exciting. I feel like a lot of our big revelations are about to come. Once people start bolting different things to these base models, that’s when I think it’ll get interesting.
Cherie Hu: Especially with what you’ve been following in the visual and text/language model worlds, both of those worlds are also going through their own respective revelations right now — whether that’s new paradigms of accessing information or tooling, or lawsuits of course, with the various lawsuits against Stability recently.
Even though big revelations in music AI are still yet to come, are there any kind of critical changes or paradigm shifts you’ve seen from other domains that might be worth considering or preparing for in the music world?
Alexander Flores: What’s the project that just came out yesterday? Open Assistant. I feel like that is a drum I’ve been beating for like the music side of things.
Lemme step back. I think a lot of the existential fear that comes from artists is having their work used to train an AI that generates music instead of them. And [00:10:00] the low-hanging solution is like, all right, well I’ll consent to the data if it’s gonna happen anyway, but give me a portion of the revenue that comes from using this model. And so people are already iterating on the ideas of having licensed models, and remunerating with people that help contribute to that. I dunno how the pooling would work or whatever, but like the concept’s there.
But I was saying for a while that usually what’ll happen from the builders is if there’s any kind of friction, they’ll build around it. Unless you have brand value to the point where you can sell a Splice pack on your own based off of your brand, I feel like that won’t work. There’s a lot of smart people, and making music isn’t exclusive to people that have made it in the music industry. It’s everything else around that path to a professional career that kind of makes it exclusive.
But you have a lot of smart people that are programmers, that are researchers, or that have friends, they will have a desire that’s like, hey look, we want an open model so that we have no restrictions and we won’t have to pay licensing for all these experiments that we’re have to run thousands, millions of times. And so we’ll come together as a community to generate these models.
That’s already happening on the tech side. So you have this project called Open Assistant, where they’re basically trying to recreate ChatGPT, and specifically one of the parts that really unlocked ChatGPT was this sub-AI where they got humans to label and generate responses and interactions and rank them, and then they took that and made an AI to like bootstrap itself. And so there’s an open project right now to build that specific part of ChatGPT that’ll help build a bigger model.
So I feel like it happens without fail. Like so many times I’ve seen it, where it’s like, alright, let’s monetize in this way, and then the community will come up together and be like, alright, well can we make an open version that just walks around this? Because our needs to scale and our needs to experiment are gonna be limited by having to go through like this checkpoint.
Cherie Hu: I guess this has happened throughout music tech history. This dynamic is not anything new of a company, say, a rights holder or trade org, imposing restrictions or like ring fencing usage of music or licensing of music online, and people just finding ways to work around them because they’re fans or they’re passionate about an open music culture, and they’re going to find whatever means necessary to promote that.
Is that what you’re saying? Like setting expectations that certain kinds of restrictions are not gonna be as effective as you might think.
Alexander Flores: That last sentence specifically of certain restrictions not being as effective as you think, I think hits it.
There’s two different things we could be talking about. It’s not like old piracy days of like, “oh, I like this song, let me just pirate it because I don’t wanna buy the CD.” It’s more like in the AI generative space, being able to even make a base model that has the functionality to do a bunch of things — whether it’s like input text and generate music or input a photo and generate music or style transfer between two different songs. Like that core piece of code and core piece of that model, like core piece of technology, doesn’t have to be styled for any specific artist. This is different from ripping an artist’s song because you’re a fan. This is more about generating an artist-agnostic model that is foundational to generating anything.
And so I think a lot of people in the music industry were like, oh, we’ve heard that you need a lot of data to train a good model, and we have the data. And so if we can contribute our data and then get a proportion of the rights, then we’ll be fine. But I think the community will eventually — I mean for every piece of software that is expensive, there’ll be like an open-source version of it because people wanna build without that restriction or people taking that cut. Just because it does limit how people can experiment or build off each other. Like the rest of the world doesn’t have the music industry infrastructure. Even I would say the music industry itself doesn’t have the proper infrastructure to manage themselves.
Cherie Hu: Yeah. That leads to a more generalized version of what we’re talking about. Separately we’ve talked about a lot of misconceptions around what’s actually happening with these models. One of the maybe most notorious recent examples of this was with, I believe, the lawsuit that was filed against Stability and a bunch of other visual AI companies in California. And there was a section in the lawsuit that had a, like a spiral-shaped diagram that was meant to represent the distribution of various properties of images in the training data behind one of these models, but it was misinterpreted as like an actual image of a spiral, and used like in a completely kind of off-the-rails way to make an argument that was not related to what the original image was actually trying to convey. And there’s a really good anonymous breakdown of a lot [00:15:00] of the misconceptions or errors that were in that original lawsuit.
As this tech becomes a lot more mainstream, I’m sure — especially because of how powerful it is and how fast it’s moving — there’ll be a lot of misconceptions about what’s actually happening that might lead to mismatched expectations about what the tech actually can or cannot do. In the worst cases, it could lead to harming certain people, like spreading misinformation. Just some things on my mind.
Are there any especially common misconceptions that you’ve seen about how these AI models work? Especially thinking for this season and whoever’s listening to this episode — which is probably someone working in the music industry and just getting their feet wet — like past mistakes in understanding this tech that you would like for us to avoid in the future?
Alexander Flores: Around generative models in general? I think the biggest misconception is that the training data that goes into training a model fits inside the model. Really what’s happening is that there’s more abstract representations of that data that gets encoded in, that sinks into the internal representation of the model, and it’s not specifically the image or the song or whatever else that you’re putting into it.
Now, there are caveats to this. There is a concept called overfitting a model, to where you really hone the internal representation of the model to where it really only outputs that same piece that you put in. There was actually a recent paper I think that came out like last week that showed some researchers were able to reproduce almost exactly some of the training data, but like the percentage is absurd. It’s 0.001% or something of all the data that went into the model. And so they’re trying to figure out why it overfit in that specific area or something.
But in general the data that goes into it isn’t encoded inside the model. It’s more like these abstract representations — like for images, like what is an edge or like composition and all that stuff.
And so it’s a lot harder to use your traditional IP arguments. I think one side has an issue of trying to stand on the arguments that have worked before around IP and protecting IP and what’s permissible and what’s not. And the landscape’s completely different as far as how the media’s being generated. But on the other side, I feel like they’re taking that argument for face value instead of understanding what the real problem is.
There is a real conversation to be had about artist data being used — whether it was legal or not, like the consent around that. And then the general implications of what happens when these models become very full-fledged and society in general is gonna have to come to grips of how we treat this and, like, is it better or worse? It could be better if it enables a lot of people to express themselves. But yeah. Do you just leave it up to natural events happening as far as what’s gonna happen to how we value art or how we value making art? And then in music specifically, what kind of models would support the current music industry if this reality comes to fruition? Does everything kinda shift more towards a classical music-like, sponsored subsidized model for people that wanna create art? Or do you accept that anyone can create art, and the path to a career in art is like less of a thing and it centers around brands?
There’s a lot of things that might go in flux, but that’s also speaking more towards pure art generation, which I feel like in the short term, like maybe five years, might not be the biggest way AI disrupts music. I feel like a lot of the disruption is gonna come around things that happen around pure audio generation. Like I know Sony has this thing where you can upscale your audio on one of the new Xperia phones and it sends it to their servers and it processes it. So you don’t need to book studio time if you wanna have basically studio-level quality vocals. So like little things like that, I feel like actually unlock a lot of opportunity, but it doesn’t solve the general supply-versus-demand issue that I feel like is existential to the music industry right now.
Cherie Hu: Follow-up question on overfitting. Just to clarify technically what’s happening for people who are listening. So, fine-tuning models on one’s own back catalog or sample library, as a way for artists to use AI in general, is very much in demand. We ran a fine-tuning workshop internally within Water & Music that was super popular. And I know a lot of producers are interested in building their own AI models.
Are you saying even within that context, it’s not like the data’s like actually incorporated into the model itself? There’s still some abstraction that’s happening there?
Alexander Flores: Yeah. I guess to get a little deeper on what fine-tuning is, if you can imagine … I feel like this analogy is only gonna work for a very small percentage of people, but like in the game show The Price is Right, there’s this game called Plinko. Or I don’t know if you’ve seen those little games where you have beads that fall down and hit a bunch of pegs and then they kind of bounce around and you’re trying to get it to hit like land in one specific vertical.
Cherie Hu: Uh-huh.
Alexander Flores: That’s kind of how you can [00:20:00] think of an AI model. You’re dropping in a piece of input data at the top, and then it’s hitting a bunch of beads and then you’re kind of trying to guide it towards a specific output. And branding the model is just kind of like shifting those pegs around so that it bounces in a way that still generates some random output because you don’t want it to generate the same thing, but you kind of wanna guide it to a specific place. And so fine-tuning is like taking the last layer, like the thing that touches that ball last, and really shifting those around to really guide it to a specific area. But those pegs are not pieces of data. They’re like concepts. And so there have been artists that fine-tune the models on themselves and mess with it. And I’ve heard that they’ll listen back to the output and it’s interesting what abstractions the AI kind of picks up as like their style. And it’s not until they hear somebody else try and recreate it where they’re like, “oh yeah, I guess that is something I do.”
So in that sense, yeah it’s still very much an abstraction. But like I was mentioning before, you can abuse the model, like really hone it in, to where it is very much outputting the input data. But it’s not because the data is in it directly, but it is encoded very strongly towards it.
Cherie Hu: Got it. Thank you for that clarification.
The second question was the existential question of like, where the value’s gonna be in the music industry once this technology becomes more mainstream. We’ve asked this of most of the interviews we’ve done as part of the season, always trying to get people’s perspective on the key macroeconomic issues around generative AI at large — looking at things like content oversaturation, wealth inequality being exacerbated through these tools, skill development, unemployment or job displacement. Those are a few of the examples that are top-of-mind for a lot of people in music who are monitoring these tools.
You said specifically the value no longer is in the music itself, but is more in the brand or the interactions around the artist’s name, or the community around it. Could you speak to your own mental model for how you’re thinking about value accrual or distribution and like how that will look in the future for creative AI at large? Like in a world where the cost of generating these things will be close to zero, if not already at zero from the user perspective. What has value and why?
Alexander Flores: I feel like it actually hasn’t changed. I guess there are different kinds of music and different uses for music. But the music that we typically think of, it’s having some sort of emotional connection to the listener. And so what pieces make up that formula, I feel like still won’t drastically change — fan-to-artist engagement, any personal connections that people have of like the song or like where they heard it or whatever.
If everyone could have a model that generates the music that they perfectly need to hear in that moment, at least the way we imagined it, it wouldn’t feel great because there is no shared narrative. That’s actually something I have to give the big capital I Music Industry credit for, is its ability to shape a unifying narrative across a culture. And that I feel like might get lost if people lean too much into generative AI. I think because of that, it won’t come to fruition exactly in the way people are fearing. Though I do think the fears are very well warranted — I’ll get to that in a second.
But besides that, we already have an oversupply issue. If you know how to navigate SoundCloud, you can find amazing music that has less than a hundred plays, less than a dozen plays. So the idea that artists are gonna be drowned out, like it’s already — like the pipeline to have a successful career depends on certain things that I feel like AI is not gonna alleviate … Things like brand are still gonna be very key. Things like overall narrative, things like storytelling, things like context … It’s a very complex emotional system, and all you’re really trying to do when you’re building this song is to communicate some level of emotional connection to somebody else.
That side of music hasn’t been very sustainable for a while. And I think a lot of artists have made their careers work through other types of work that are higher margin. So actually I got a lot of these thoughts from someone I know who makes famous podcast music. It’s really good. And he says that pays the bills for him to make the music that he actually wants to make. Not that he doesn’t wanna make podcast music, but like, there’s like different types of music, right? And so the podcast music, because it’s able to pay his bills, means it’s very high-margin, and [00:25:00] so for the people on the other side of that deal, there’s a lot of market opportunity for someone to undercut that space. And so that’s where the danger I think comes from, is that that piece of the pie that is actually supporting the “real” music that gets generated will get smaller. And so the general concept of how “normal” art is gonna be generated is gonna be interesting.
I wouldn’t say that’s exclusive to music too. I feel like that’s a general mental model to have towards AI. Like everybody’s shocked, like why did all the things we said were “AI-proof” become the main things that were tackled by AI? Creative writing, art, like all the big concepts. And I think it comes down to the market opportunity to make that stuff more commodified. There’s already models that like can ingest a lot of medical data and you can talk to it kinda like an advanced version of WebMD to diagnose and give recommendations. And in a way that a single doctor I feel won’t be able to ingest all these new papers that are coming out and all the new information that’s coming out real time.
And so there’s stuff happening there — there’s stuff happening for lawyers, consulting, like all these very expensive, high-margin industries — I think are feeling the most threat because there’s the most demand to have that stuff more commodified and accessible. There’s just like pure capitalism forces at that point. It’s like, where’s the biggest opportunity to focus our efforts now?
Cherie Hu: I think that’s been the biggest misconception or dent in my mental model I’ve had to revisit, or I think a lot of people at large have had to revisit. I think the previous iteration of conversations and public discourse around automation’s impact on jobs was very much like low-level factory work, truck driving, roles that would stereotypically be considered unskilled. But what’s driving my mom to learn about ChatGPT is like reading about people using it to write high school essays and get A’s on them.
So the point of impact that people are focusing a lot on now, or where it is actually driving change in paradigm shifts, it’s like actually at the top layer in terms of perceived value of the work in our current capitalist society, rather than the lowest level. So yeah, I’m still processing that myself in terms of that being unexpected.
Alexander Flores: Yeah, I mean, I guess a basic capitalism model is like, what is valued typically isn’t what is hard. It’s how replaceable you are in that system.
Let’s say you’re a waiter. It’s a very hard job, like one of the hardest jobs you could do if you have a long double shift. I valeted, that shit was brutal, like just running constantly. But it’s a very replaceable job. If you leave, you can find someone else to fill that position. Versus these other jobs, where the “value” that the system assigns to these other jobs are because it’s not easily replaceable. You can’t just replace a doctor. You have to do, I dunno, how many dozens of years of training and education.
So that’s kind of where AI is stepping in, is like these things that typically weren’t easily replaceable, are being replaceable. And that’s where the big opportunity is.
Cherie Hu: Especially those kinds of roles like doctors or lawyers that go through such a long educational pipeline to get there.
Maybe there’s a lack of understanding about the fact that a lot of those kinds of functions are completely replaceable or commodifiable with the right tech. Almost every single lawyer that I’ve spoken to, like they are so excited for this kind of tech to really gain hold because they and their staff and interns are spending so much of their day-to-day doing very grunt busy work and admin work. They are getting paid a lot of money to do that work. So it’ll be interesting to see how the incentives and the day-to-day of being someone like a lawyer or doctor, how those change when you just have the world’s knowledge at your fingertips.
There’s something called BioGPT that’s ingested tons of biology papers and you just query them and get to findings from papers in such a short period of time. Just thinking about what I went through in college, in humanities classes — and what it would be like to get to that level of immediate insight, and how that will just change education in general. Yeah. It’s wild to think about.
Alexander Flores: There’s a lot of controversy around that too, right? Like, oh these models can hallucinate and hack information. And like you can’t just have this model start saying things.
But right now I think there’s real value in these models being accessible to communities where that’s not even an option. Like very rural communities where they don’t have a doctor that knows a lot of stuff. And so even if it’s not the best, it’s like leagues better than what they have access to right now. And so that modification I think is gonna be very powerful.
And so like music [00:30:00] kind of comes down to the same thing. I feel like there’s a lot of people that would be able to grow the music value pie, but it’s not viable for them to play that game. Especially if we wanna wade into the potential future of the Metaverse, right? So much of those spaces will require music, but I don’t know how viable it’ll be for worlds to be built on the current licensing model. For this new reality to start being built, you’ll need something that’s… yeah, maybe it’s not as good as a real artist, but it’s something that we can start building on and eventually it’ll grow the pie for where music could live, and the amount of opportunities there are for music to be valued.
Cherie Hu: Yeah. I’m thinking of Roblox reaching this huge settlement with the National Music Publishers Association a few years ago. I think they were hit with a $200 million lawsuit. And as part of the settlement, they removed the ability for users to upload their own music to include in Roblox games, with the exception of just a small handful of labels. And like finding information about why that is, is very hard.
I don’t work at Roblox. I’ve never worked at Roblox. But if they can navigate the legal licensing landscape in a shrewd way, they can absolutely come out the other side with some tool that allows users to AI-generate their own music for games. Or like have some sort of AI-generated sound effect framework for whatever games they’re building, and have that music be pretty decent quality. Such that they don’t care if some insert-major-label-artist-here has their music available. Like, no, I’ll just make the music myself. That would be an interesting and very plausible plot twist, I think.
Alexander Flores: That kind of leans into functional music, right?… There’s a limit to the value of that generated music unless you’re creating value around that generative song. Like you make this song and now you start building the experience around that specific song and start accruing value to it by its use, or by its style that you come up with, then that would work. But otherwise it’s just gonna be very functional.
The whole IP argument in music is actually very interesting. In general. I remember I had this idea — I won’t call out who else had a similar idea, but I’m glad I’m not the only one. There’s this obsessive concept of like, “oh tthis piece of music is like in my style, and you can’t do anything close to it. I’m gonna pull out a lawsuit against you.” I’ve always wanted to get all the music from, like, all of time and analyze it and just point out that, look, nobody’s being original here. Everyone’s infringing on everybody, if you like, really look at it as a whole. And so let’s all just move on and start thinking of new models to approach this with.
Cherie Hu: Easier said than done. Of course.
Alexander Flores: Oh yeah. I have no idea how you would get that data, or the compute to analyze it. I feel like it might not even exist in the world yet. But it’s just a weird argument that we keep coming back to. This is probably gonna go down not that well, but it feels a lot like the patent trolling system that goes on. I’m more familiar with it because a lot of the courts that they used to run these patent troll cases are in Texas where I’m at right now. Not the best thing for like Texas — I mean we have a lot of not great things for our reputation right now, but that’s something I’m particularly ashamed of.
But yeah, it basically comes down to who has enough money to argue that this little plot of land and the IP landscape is theirs, and that you can’t touch it. And I dunno how much of that is grounded on them being there.
Cherie Hu: So much to dive into with music culture in particular. So already, like the music hit-making machine is a factory of, could be dozens of songwriters and producers that a major-label album goes through by the time it hits the proverbial shelves. Most people will not know about the people behind the scenes like the songwriters and producers making that factory running. They’ll know about the performing artist. There already is a huge gap between the huge crews of creators who are making music and writing the music that’s going out, and who’s actually getting at least the public-facing recognition for it. Ideally if publishers and PROs are doing their job, those people behind the scenes are still getting royalties. Like they’re still hopefully getting checks from those songs getting big.
I think what will happen with AI — this may be a cynical way to put it, but I think it’s partially true — is that people won’t know who wrote it, and the people who actually wrote it won’t get their check anyway. Or just, like the fundamental structure of payment and attribution and compensation will just be completely thrown out of the water and will have to be rethought, in terms of how people share in the value, [00:35:00] if there is value around like this music that’s generated. Because I think with these AI tools, it’ll be increasingly hard to police usage of certain kinds of music in the way that you described.
Alexander Flores: Hmm. So are you saying like if I press a button on an AI model, what I’m doing is just self-selecting on what sounds good? Like I am the filter for the actual output of the model, but the output, the model’s the one that’s like generating the music, and I don’t get like a special label for filtering what’s good and like I’m not gonna get compensated as like an “artist” for that. Is that what you’re getting at?
Cherie Hu: I think this goes back to the earlier discussion around training data. A huge trend that’s been happening — the number of interpolations happening in the Billboard Hot 100 has gone up significantly. And it’s because of growing influence from publishers and huge catalog acquisition funds like Hipgnosis and other private equity funds. There’s an increasing push for songs that would hit the Billboard Hot 100 to sample previous songs.
So it’s just like a catalog ouroboros, like a snake eating its tail, like we’re just stuck in this nostalgia cycle of songs interpolating like older songs. Like Rogers and Hammerstein claiming 90% of the publishing on this Ariana Grande song, “Seven Rings,” I think. Like, vast, vast majority. So there is this culture of like, okay, if we’re sampling this song that got written before, the songwriters who contributed to that kind of melting pot of references should get compensated.
I think the way that AI models work. Yeah. Going back to an early discussion about. to what extent the training data is actually incorporated into the model. Yeah, just that, that is just not gonna work.
Alexander Flores: I’m not gonna say it’s impossible because I know there’s a lot of people experimenting around it, but like, it’s very hard, maybe impossible to get an output and know exactly every part inside the model that helped influence that. And to what percentage.
It’s going back to those Plinko pegs. Like you have all these pegs that you’re shifting around and each piece of training data shifts that peg around a little bit. And sometimes it’s shifting in the wrong direction and you’re having to correct for it. How do you even attribute data to shifting it the wrong direction and having other data shifted more towards the correct direction and by how much, and like how valuable was that shift? What it looks like to interrogate an AI model without actually looking at the weights inside or understanding what’s going on, it’s kind of like trying to give someone an MRI scan by just talking to them.
That’s kinda where we’re at. It’s very hard. And yeah, you can just talk to it a lot and give it like a bunch of inputs and see how those inputs affect the outputs and change the inputs a little bit and see how the output changes a little bit. Yeah, it’s very hard.
But yeah, that gets into artists having a model that they give data to, that then they request to get some sort of compensation for, and this gets back to the original point to where like, yeah, that, that might exist. And there will be certain brands of music, certain artists that have enough brand value to where — this just gets into like the Spotify versus torrenting — like it’ll be easier to just use this licensed model if it’s at a reasonable price. People will be like, oh yeah, if you wanna sound like this person, just use this model. But outside of that, I think within three to five years, the people that are wanting to experiment with the music AI are just gonna come together and build open models where they don’t have to worry about that.
Cherie Hu: I think the value — my hypothesis is that the value of AI-generated music will be determined collaboratively by the person who made it and the person who is consuming it. Assuming those are different people; they could very well be the same.
So for example, a couple years ago I interviewed the CEO of Boomy, Alex Mitchell. Boomy’s one of the quickest-to-use, off-the-shelf music AI generation tools. And he’s very open about how a lot of the songs on the platform are not “good” in the traditional sense. Like they’re not radio- or playlist-ready at all. But people still find meaning in them.
Like one early use case for Boomy was people making diss tracks to use in Discord servers. So like whenever people were in a Discord server and someone lost a game, they sent this very grungy, distorted-sounding track from Boomy and would label it some inside joke or something. And these are a bunch of like 12- to 15-year-olds.
In the traditional industry framework, that has no value, but in this social context it does have a lot of value. Like, these kids are able to express themselves in ways that they really couldn’t before.I don’t think value can be limited to some arbitrary determination around aesthetics, like a song that sounds like this is somehow more [00:40:00] valuable. It’s more like, can the person who made it convince themselves and other people that it is valuable? And do the people who are consuming that music — assuming they’re different from the person who created it — believe that it is valuable?
Season 3 Behind the Scenes: Integrating AI models into Discord
Before diving even deeper into the existential questions about music and AI… I definitely wanna touch upon some of the work that you did designing experiences for our Season 3 research sprint.
For people who are listening, just to context-set a little bit on the scope of things that we did. As I mentioned earlier, we had a legal thread, an ethics thread, and a business model thread. The structure of those were pretty similar to their analogous threads for previous seasons. We set up a thread in our Discord server, with a pretty fluid contribution culture — people could tune in and out, could even lurk on the threads if they wanted to. We would hold weekly calls to manage the project and delegate tasks for the week. And we would also share a ton of resources and links throughout the week asynchronously. Really building this culture of more collaborative learning that Water & Music is all about.
What we did with Season 2 that was different from the previous two reports we’d put out was also get more hands-on with the platforms we were covering. So, in case people didn’t see, we had a whole metaverse meetup thread that led to an output analyzing design choices and critiques around current virtual world-platforms, in terms of how people could interact with each other and express themselves socially, with a focus on using music to do that. I think it was just a really cool experience that made our research that much better.
For Season 3, we took that to a whole other level by having several different layers of interactive experiences around AI tools. We held a series of weekly workshops that started out as creative challenges, where every week we’d gather and do a walkthrough tutorial and hands-on office hours around visual AI tools, text AI tools, and then going to music in the last month. For music, because the tools are super new, we collaborated directly with a lot of the companies building these tools in real time. So we worked with Harmonai which is building Dance Diffusion. We worked with Never Before Heard Sounds and Pollinations, and brought them in to do walkthrough tutorials with the community. Those workshops were open to all members, not just to season contributors. So a very open, fluid, interactive experience.
One very critical part early on of getting people excited about AI in general and curious about using these tools was integrating a lot of visual and text AI tools directly into our Discord server, such that people could just call those models up directly with slash commands. We ended up specifically integrating Midjourney, Stable Diffusion, and various wrappers around GPT-3 into our server. We have a whole forum where you can try out a challenge like generating crazy lyrics to a song, or generating an image inspired by an album cover.
I’m curious from the engineering perspective as you were building out these bots, especially the custom ones around GPT-3 — was there any goal that you had in terms of how you wanted to design the experience around the bot or the intended outcomes, especially for people in the Water & Music community who might not have been as familiar with the tech already?
Alexander Flores: Oh yeah, totally. A lot of it’s just making it as simple as possible. Especially like early art-model days, so many of the tutorials were like, oh yeah, go open this Colab notebook. And if you’ve never coded, sure, it’s relatively easy — Colab notebooks are like a Word document where each paragraph is like a code that you can execute, and so you just kinda hit the play button next to each paragraph to execute the code. And you’re supposed to just run down the page or just execute everything and it should work. But like Python dependency trees and like things like packages being outdated and like breaking each other — it’s a night.
And so trying to find a way for people who have no technical experience to have a nice interface to where they can just really just experiment or play around. It’s not even about making mistakes, it’s about poking at what this is and trying to understand what it is. I think that was the biggest goal, just give people time touching the tool to understand what it’s capable of, and maybe what it’s not capable of. And also in a space where they can see other people experimenting.
Actually Midjourney nailed this. I’ve seen Midjourney get so much shit for like, oh, why is it just on Discord? They were able to iterate on their model a lot and not spend a lot of dev work building out this whole other separate platform where you have to go use it. But more than that,their model worked. They have the most populous server of all Discord, I think 10 million members, which is wild. But what they got for free was public experimentation with people. [00:45:00] Coming in, nobody has any idea of how you even interact with these things. And so being able to see what other people are doing and the results they’re getting helps bootstrap newcomers very quickly into understanding what to do, and like bouncing ideas off of each other and like riffing off each other.
So yeah, trying to simplify that interface as much as possible was the main goal. Midjourney integration was easy enough — you just import it to your own private server. The other ones took a little bit more integration, because these models aren’t cheap to run. With GPT you just query an API and OpenAI is doing all the compute. We were guided by one of the people from Replicate to integrate their service. What they provide is GPU hosting for running these models.
So I guess, yeah, I should step back a little bit. On a high level, code has to run somewhere, right? If you’re not executing on your machine, someone’s paying to execute it. That’s the whole value prop Google gives, like you give us your data, we’ll give you compute for all your services. AI gets a little trickier because you’re not using your traditional form of compute. You have CPUs and GPUs, and CPUs are very analogous to like the human brain. They throw problems at it and it tries to solve it piece-by-piece.
GPUs, it’s a graphics processing unit. They came out of trying to do video games. Like it’s too expensive to try and sequentially render an image or a screen inside of a game. And so GPUs are a lot less powerful, but there’s like a bunch more of them. So like your average processor right now probably has four cores. Like your average CPU has four pieces of the brain and they can all run things at the same time. Versus your graphics processing, itt will have thousands, maybe millions, I don’t wanna say millions, but for sure thousands to tens of thousands of like little tiny processors. But they’re very optimized to only do one thing, which is like matrix multiplication, which is to manipulate triangles on a screen and like output things. But there’s a lot of limitations of how GPUs can run and the way you program for them. So those are a more recent thing to be running in the cloud, like to rent a server somewhere. It’s only in the last couple years to where it’s semi-viable to do that.
But even then, transparently, I run some of our Discord bots on a $5, $10-a-month server that’s just using CPU compute. But if you wanna do anything with AI, you need this separate kind of graphics card, the GPU that’s very special, and that you can easily get to $20 a day on the low end. And so finding people that provide the infrastructure so that you’re not having to rent out a GPU for a month when you’re not using it 24/7 — that’s where Replicate kind came in, they have the infrastructure to where they’re running these models and you can create these models.
We also explored banana.dev, which we were close to getting working, but the quality wasn’t like, like just the timing didn’t work out. But they also have infrastructure where they spin up GPUs on the fly depending on what your demand is. The cold start times are a little, they were a little too slow for using on a Discord bot — it would take a minute to start up the GPU — but you could spend way less money than trying to host on your own server.
So that was more specific to the art generative models that required some compute happening somewhere. So like Stable Diffusion specifically, you need to run that model somewhere. OpenAI was just a more direct, straightforward API integration than anybody who has basic programming knowledge could probably figure out how to do by going through the docs.
I iterated a lot on the different interfaces for it, because Discord does have a few options as far as how many different fields you could provide it and how much you can abstract away. But it was still very limited. So you could have an open field where you could input any kind of text and then send that to the back end, and that talks to GPT. And so I did have one version, which was just plain, well, hey, you’re gonna send something to GPT and you have to send it with the prompt and everything, kinda like ChatGPT, and then it’ll come back with the response.
And then from there to make it more relevant to the different challenges we were doing, we basically just took that input and would wrap it behind the scenes with a specific prompt that would guide that input towards a specific task. Like one of those was lyric generation, so let’s have two different input fields in the Discord bot to where you could put in what kind of genre you want the song, and like what topic you want it to be about. And then on the back end, it would take those two inputs and then inject it into a prompt that we iterated on to like where it got decent output and then it would go request and come back.
And just on that basic concept of the two different prompts, the two different inputs, there were several iterations just because there’s limitations on Discord input. So you can have your normal standard [00:50:00] slash-command input, but they also have this like modal input, which I preferred, but the modal input would not allow for a dropdown select menu. In an ideal world, it would’ve been text input for what you want the song to be about, but then a modal dropdown menu for all the different genres that you could have, and having more parameters. If you could specify like, I need them to be in this pattern, I need this many syllables per line. Having those kinds of options would’ve been a lot easier if you could have those kinds of dropdown list interactions.
Midjourney gets around this in a more confusing way with their double dash commands, but that’s fine for people that have gotten past the initial experimentation phase. For people that are coming in cold, having more control over how to communicate with these AI models would’ve been nice, but just having them in the Discord in general where people were already at was pretty key.
Cherie Hu: I have a follow-up question for you that’s not season-specific, but it’s a more generalized version of what you were just talking about, which is the the interfaces around these models in general.
I was more involved in the business model thread, and a huge takeaway from that thread is that with music AI specifically, they’re newer in their evolution compared to visual and text AI such that the models that are out there, you do still need to know some coding to really be able to work with them effectively. And there’s a huge opportunity to build friendlier UX around audio and music diffusion models.
With visual and text AI tools, there’s a lot of preoccupation around the prompt. There’s a whole cottage industry around prompt engineering. But I’m curious how it would be friendlier if people didn’t have to write out an entire prompt, but instead could choose from some preset options, for example, and then if they wanted to customize, they could add additional layers on top.
Do you have any general stance on how the interfaces around these AI models and tools could improve to make them easier to use? Whether that’s making prompting easier, or just moving away from that entirely and having some different interface around it. I don’t know if you have any stance on that and where that might go.
Alexander Flores: Yeah, I mean, just having layers of opinions I feel would be helpful. Don’t be afraid to have an opinion. That’s what a dropdown menu is, right? Is here’s my opinion on what are some good defaults that you should experiment with.
But yeah, this gets into good API design, or good abstraction design. Have a high-level version, but then also have layers where you can dig deeper and get more custom on each. Always having that fallback of just raw input if you know what you’re doing. I think we’ll have a pretty fun wave of different interaction modes for the media that we’re consuming as far as what the AI can do with it. But yeah, not being afraid to have strong opinions on how to guide people to at least start using it.
Cherie Hu: I think that leads to honing in on the specific use case that you’re building for. The opinion around what kind of template or output would be good for, like, copywriting or social marketing copy is very different from like, what would be helpful for longer-form creative writing. And we’re already seeing really interesting differences in tooling around each of those specific use cases. So just keeping that in mind.
Alexander Flores: I’m realizing now this actually isn’t even like original insight. This is very basic — make it easy for your customers to feel good.
This is what Instagram did really well, like pre-Facebook. Early Instagram was like, oh, hey, I took this really shitty two-pixel image on my really old phone, that had a horrible camera — make it easy for those pictures to look good through the act of filtering it. Or like really good drawing applications — turns out that they’re incredibly simple, and a lot of engineering work goes into the dynamics of the brush and how you use it so that it looks fluid and not jittery, so that it’s easy for the user to make good output and optimizing that path to where the user has a delightful experience, versus fighting all the different parameters and all the different things that they could change. But also striking a balance where you do allow for that change to happen.
That’s actually like Apple. Apple’s good at the opinionated design part. They’re not good at letting people change what they wanna change about it.
Cherie Hu: Going back to the workshops that we ran. As I mentioned earlier, we sequenced the workshops to start with visual art, and then go to text and then go to music. That was kind of most to least easily accessible in terms of the ability to play around with those models. I believe you led the workshop we did on ChatGPT, and I remember that workshop [00:55:00] definitely ran over. It was a little bit of hands-on experimentation, but a lot more general existential angst. I think one of the prompts was generating a template for a music license or something. Something that we’ve talked a lot about was using ChatGPT in a smart way to generate an actionable artist strategy and plan.
I’m curious whether from that workshop specifically, or like in your subsequent explorations in the following months — were there any takeaways for you about that tool specifically in terms of understanding everyday industry pros’ relationships with that tech, or the range in which it could be applied across the industry? Any takeaways there?
Alexander Flores: Yeah, I feel like the obvious one is that AI is gonna hit the industry not through audio models, but through these other forms of AI. How much of the industry is built through human interaction and negotiation and like cold outreach?
I remember I had planned a bunch of demos. I had pre-cleared these ideas to make sure they would work in ChatGPT, and then we could demonstrate it to the community and ask questions and get feedback and iterate together on these prompts.
Towards the back half of the workshop, I had set up importing an artist contract that you would sign within a label, and asking ChatGPT to highlight any risks associated with this contract for both sides of the party. But someone in the community like 10 minutes into the workshop was like, yo, can we do this? It’s pretty apparent that there’s a strong demand for just understanding things, and having a second opinion on all these different documents that you’ll have to navigate as you enter the music industry.
That seems like the most directly impactful right now cause. Yeah. How much of the industry and everything we do is just centered around communication, and how communication can be augmented through these like language models now.
Cherie Hu: So fascinating. A big takeaway from our AI survey that we ran which had 150+ responses from mostly artists and producers, was that the number one use case mentioned was marketing and promotion. It was not music creation … A lot of the jobs that are facing the most existential crisis head on are in very documentation- and communication-heavy work. Like artists don’t wanna care that much about social media, and any tool that can allow them to spend less time on that and more time doing really interesting creative work and manifesting creative ideas on the music side — that’s still like the core of why they’re in the industry.
So, yeah, with the state at which large language models are at, it’s not just the most in-demand use case, but like the most plausible to implement right now.
Alexander Flores: I would say even beyond language models… There’s this trap of, like, the goal that we should all be reaching for right now is audio generation. There’s so much more to explore.
I touched on it briefly as far as upscaling audio, so that anybody anywhere on a phone can record vocals or do things that used to require a lot of infrastructure or require deals so that you could get studio time. That’s one aspect of it, but there are so many other things.
I feel like it’s known that most of the money that an artist can make is going on tour or doing live performance. And that aspect of where AI comes in that space hasn’t really been talked about as much — which is wild because if that’s where most of the money is for these artists, then like why isn’t that a stronger point of conversation? Like it should be easier to put on a live performance. And if AI is able to commodify different aspects of that — especially if you want like light shows for example. In order to have a decent light show, you have to have the guy that’s running the board and like understanding the venue and the lights that are available there. There should be a way where the board talks to a controller box that talks to a camera. You put it in the middle of the room, have the board sweep every one of the lights across all of its values, like what it can do. The camera can just sense the capabilities of the room, and then you map a show to that room and just lower that barrier.If everything centers around the value of music coming from an emotional experience, then commodifying everything around it so that it’s easier to create those emotional experiences. I feel like that should be like the core framing and seeing where that can apply throughout the industry, versus oh, let’s just create sound. Like that’s not super helpful, [01:00:00] or it’s a very narrow-sided view of how AI can help the industry.
Season 3 Behind the Scenes: AI News Bulletin
Cherie Hu: I had one more question related to some tooling and experiences around the season. In the later third or so of the season, we launched something called an AI news bulletin, which had nothing to do with integrating AI as a creative tool into the server, but was just a new read-only channel where it’s mostly you and myself for now who are curating critical AI news as it’s coming from Twitter, and acting as a filter to make sure that you can go to that bulletin and you know, whatever you read is just critical things about the landscape that you need to know now if you work in the music and entertainment industries. The filter is very practical, immediately applicable updates around new tools that can help a wide range of tasks in the music industry — so we don’t just curate audio, we do curate text and visual AI tools a lot of the time, legal updates as they happen, and important context on the state of the business.
And I know that you, Alex, had worked a lot to set up that bot. How that channel works is behind-the-scenes, Alex built a custom Twitter posting and embed bot, and we use a slash command in a private channel to copy and paste a relevant tweet, decide which channel to point it to (in this case the AI news bulletin), and then we can add our own curator note, which we do a lot of the time. And the embed itself, I know there was a ton of back-and-forth of trying to get like the formatting of that right, let alone make sure it included whatever information or context people in our server would need to get the most out of each tweet.
I’m curious if there are any general takeaways for you. Well, I guess two questions for you. One is your motivation to build this kind of tool for the community if Twitter already exists and Twitter lists already exist, and then two, any takeaways from the development perspective in building that out and seeing how people have interacted with it.
Alexander Flores: Yeah. I mean, initial takeaway just before we jump into the why is that Twitter’s API is hot garbage.
Cherie Hu: Yeah. We have to pay to use it now. Thanks, Elon.
Alexander Flores: Yeah. My hope is that it’s pay-to-use because they’re trying to shut it down and rework it. But that’s a big dream.
Cherie Hu: Wait, your dream is that they shut it down?
Alexander Flores: I would love for them to just rework it from scratch. That might require them limiting access and then charging enough to where it’s worth it for them to sustain it while they rework it. But whether that reworking plan is on the table, I have no idea.
Cherie Hu: We’ll see.
Alexander Flores: Yeah. For the why, I mean, a lot of these look very similar to your normal Twitter embed, but what I noticed with sharing Twitter links is it’s usually not enough to communicate the value of something. A lot of times, people have to click through, or even just simple stuff like how Twitter’s link shortener will be linked in Discord instead of the actual URL, and so if that tweet gets deleted, then that link is gone. So, the main exercise was actually just kind of recreating Discord’s Twitter embed technology, like you create a bot that makes an embed message and then you format it the right way and hyperlink the right text to the right thing. And like you’re extracting that information from the Twitter API. And right now it’s mostly a single call. I probably wanna rework it to do multiple calls so that the code’s cleaner because yeah, a lot of the conversations that are worth talking about and highlighting usually incorporate a quote tweet or something, and your normal Discord embed’s not gonna display that.
And so the bot was structured to where it would also walk the data structure and embed the quoted tweet. Embedding polls was like something that had to be figured out. Funnily enough I was using Github Copilot to build this. It’s very good at generating code for things that there is a lot of code out there for. Like a lot of people write Discord bots, so there’s a lot of references and there’s a lot of common patterns and so it was able to infer from my code, like what I was trying to do. And so sometimes I would just be like, let’s see what you wanna do, like just autofill what it was recommending and then go back and tweak it.
But yeah, the main motivation was to have a more rich version of the Twitter embed that properly captures the information that we’re trying to share. And then even behind the scenes, the work that went into that was curating a private Twitter list that we referenced that has a bunch of people in the industry, and like walking Twitter and walking the Twitter suggestions and pulling from people I was already following, and like just compiling this massive list that we could sift through. There’s a decent amount of work there.
I do think it’ll be interesting if in the future, we have enough references of the raw list versus what we’re curating from it, and we can build an AI that automatically highlights a suggested-to-post feed, and then you’ll have curators automatically just go in and click to approve.
But yeah, the general infrastructure’s built so that ideally it’s not just Twitter and we’ll be able to [01:05:00] highlight specific articles. And also be able to curate messages from the community that get shared inside our server where you can properly show a reference of who was saying what.
So yeah, right now it’s pretty simple but still valuable, which is just Twitter with an augmented version of a Discord embed that provides the context necessary to have a proper understanding of why we were trying to share something. Hopefully it turns into something a lot more and gives the community tools to help self-curate information, which I think is very key.
Cherie Hu: Yeah. That’s very much part of the core ethos since three years ago when the Water & Music community and Discord server first launched, which was creating a culture of collaborative curation. I think this AI news bulletin is just the two of us right now, but to your point, people are already sharing Twitter links across a ton of different Discord channels. Not just on the AI side, but #web3 remains the top channel in the server, and Twitter is the main gathering space for that community also, at least publicly.
So I was trying to think through cleaner ways to synthesize and represent that, and also highlight who those curators are in a more systematic way.
Alexander Flores: That’s not an easy problem. Like even between us, we’ve had lengthy discussions on whether something should be curated or not and why, and I feel like we’re usually on the same page about a lot of things as far as what’s important. So even though we work together a lot and have a shared understanding of the world, if it’s hard for us to come to an agreement, that’s a very interesting problem to try and solve at a community level.
It’s been a pretty fun exercise to work with not just at the technical level, but at how a community can curate and who gets to have that say.
Cherie Hu: This is definitely outside of Season 3 but very relevant to Water & Music, which is the challenge of scaling voice. This gets to a kind of closing set of questions or considerations I wanna touch on. But on your last point, I completely agree that it was very necessary for just the two of us to be the first people to start as the curators, and not have it be completely distributed among everyone who’s in the season. It took a good couple weeks to start to establish some shared, consistent criteria that we kind of knew was common knowledge without having to speak it out loud, in terms of what is or is not valuable, or what kind of tone of voice of a given tweet is valuable, like is something too technical versus not.
And even adding like two more curators into the picture, I think we’ll make that coordination that much more challenging. Coordination around having a shared voice and narrative and perspective on what we’re researching has never been easy. I think there’s a reason why editing, especially for large-scale collaborative projects, is consistently one of the biggest challenges in our research process. Not just because we don’t have as many editors, but also because that work is just very hard of trying to align 10 different people who may have touched a Google Doc around some shared understanding of who we’re speaking to and why, and what stylistic tendencies to lean towards in writing. It’s a lot to coordinate that has a little bit to do with tech, but a lot to do with just redundant communication and back-and-forth.
So yeah, I kind of see a parallel with self-organized curation being I think a really cool goal and a goal that we do have. It just has to be built over time through these very small-scale experiments that allow you to establish a voice from there. Like it can’t be a shortcut.
Alexander Flores: Yeah. Language models kind of play an interesting role in all this. A kind of dystopian future I can definitely see happening is where a lot of our future’s parsed through language models to where, like, the information we’re trying to communicate gets transformed into the most approachable way for the person that we want to listen to understand. That gets into like the social norms within different communities and like having people be able to step into them fluidly. It sounds useful, but also kind of dystopian in the fact that nobody’s actually speaking to each other. They’re speaking through this filter. I could see that being useful for unifying voice, but then also… yeah it’s a weird space to be exploring right now.
Cherie Hu: Definitely weird.
Water & Music’s early AI strategy
This gets into general closing questions and going towards my first question to you in this episode about what you’re working on within Water & Music. Just thinking through how to operationalize a generative AI strategy or workflow for a media company or for a research organization.
This is kind of related to how the way that AI will have to be implemented at a lot of music companies is not gonna start with music creation. It’s gonna start with all the activities around it. There’s some parallel to Water & Music in terms of how at least you and I use AI tools in our day-to-days. And I think it might be good to [01:10:00] give people a preview of that.
So you mentioned that you use Copilot every day. Do you use it on a daily basis at this point?
Alexander Flores: Every time I touch code. I wouldn’t say that’s everyday, but like a lot throughout the week.
Cherie Hu: Cool. In what specific ways has it changed your workflow? Has it just reduced the time to get to a certain endpoint? Or are there other ways that it’s changed?
Alexander Flores: It depends on the program that you’re writing, because it’s based on the dataset and what knowledge it has. There are different layers. A lot of writing code, I don’t know how common knowledge this is, but it’s actually not writing code. Famously, one of the early text editors that programmers used, it was called Vim. And it’s notoriously difficult to use the first time you run it, like you don’t even know how to exit the program. You’re just lost. Because it’s default mode isn’t like any other text editor — it’s built around navigation and editing and not writing code. You have to enter a different keyboard mode so that you can actually start inputting text. So a lot of programming is not just straight inputting text, it’s manipulating code or trying to edit mistakes you’ve made.
With Copilot, some of it’s autocorrecting, like autofilling. It gives you a hint at the best practices as far as examples that it has in its dataset. So if you’re starting to learn TypeScript or something like this augmented version of JavaScript, then you might not know how to structure a specific thing, and you just kind of start typing and it’ll give you a pretty decent recommendation and you can go with it and see if it works. That’s versus if you’re trying to learn something on your own, you’d have to go dig through documentation and ask a bunch of questions and find somebody.
It’s actually been pretty helpful for me because like we’re not part of a big company with a big dev team that has a lot of experience, where I have somebody that could just reach out and ping. Having these other resources of information is pretty useful.
And then ChatGPT is useful in other areas because you can query it instead of sifting through documentation, you can kind of go through that next level of like, all right, I wanna build this specific thing — what would that look like? And it can get you close if you’re trying to build something that’s pretty commonly built, and then it’ll recommend some parts of code. And I also ask, like, wait, explain why you’re doing this, then it’ll give really practical reasons. It’s a much quicker learning cycle for programming, because programming can be difficult as far as constantly having to learn new things and integrate new things. It’s not a very static thing unless you’re like one of those OG Fortran developers that’s getting paid buckets cause you’re like the only person who knows how to do it. Any of those like super old languages that like infrastructure’s built off of.
But yeah, so that’s as far as how I’ve integrated it into my workflow. This hasn’t been directly relevant to music stuff, but also some personal exploration of understanding shaders. It’s actually really good at shaders. Like I mentioned before, earlier graphics programming is very different and has a completely different paradigm as far as how data is moved around and how you manage state and memory and even values that you’re trying to reference later in the program. It’s a decent learning hurdle to understand how to write shaders well. So being able to go to a repository on Shadertoy and highlight code and be like, can you explain what’s going on here? It’s super useful and it’s actually scary good.
I think a lot of people are trying to push ChatGPT like, “oh can it generate code? Oh no, it doesn’t work.” No, prompt it better and narrow the scope of what you’re trying to, and it has enough information where it’s still very valuable.
Cherie Hu: It’s interesting to compare and contrast that to how I use GPT in my day-to-day. So I can maybe walk through that and then can zoom out and give a preview of how we’re thinking of using it strategically across Water & Music as an organization. It’s still very early on, but I think it’ll be good to talk through ideas for that.
So, I do a lot with words. Lot of writing emails, writing articles, editing articles, reading articles and papers, trying to summarize them. General note-taking. Just like communicating and doing things with words being like the core of so many industries, especially in culture and media.
GPT has really helped me in very specific functions like writing and research that has saved some time. Like, as of recording this, we’re on deadline to finish copy for our Season 3 report. And AI has been very helpful in getting through the last mile on some things. For example, I am a very verbose and long-winded writer by default. You know this very [01:15:00] well, Alex. If I had no restrictions to write an article, it would probably be 30,000 words long, and involve lots of going down various rabbit holes, coming back, going down whole other tangents, coming back. Obviously for a time-strapped music industry audience that just wants critical information to be smarter and do their jobs better, that’s not the best combination. So, I use it every day in whatever writing I do to try to see if I can communicate something a lot more clearly, with a much stricter limitation on word count or character count.
We have a new Starter Pack series that we launched as part of the relaunch of the free tier for our newsletter. Each issue in that series breaks down a fundamental concept in music and tech, and contextualizes those concepts against specific findings from our research projects. To put those issues together I have to, as succinctly as possible in 200, 300 words, summarize the key findings from very longform research projects that we’ve released previously. Those Starter Pack issues would not exist without ChatGPT.
A more recent thing is that ChatGPT has actually helped me find language to articulate my own style. Something that we’re trying to do in the coming months is revamp our style guidance early at Water & Music, because there are now so many people writing for us. Me and the few other editors on staff, we can verbally express why we made a certain editing decision for a certain article. We can definitely do a better job at generalizing that feedback into a single document that we can give writers as they’re working on an article, to better prep them for what’s to come from the editing standpoint.
And so I actually copied and pasted a newsletter that I wrote into ChatGPT and asked it to describe the style. I definitely plan on taking some of that output and incorporating it into our style guide. Not to have it as a substitute for how I am thinking about Water & Music’s style at large, but to have it be just like an external brainstorming assistant.
To sum up that ramble and then turn it back to you, talk through some ideas for how we’re thinking of implementing this across the org… There are also a lot of research-specific tools like Explainpaper, Symbiotic, Metaphor, Perplexity. The function for those is often like allowing you to upload a document, usually a PDF, and ask it specific questions and it’ll spit out an answer for you with decent accuracy.
So how I’ll use that suite of tools that use GPT, like ChatGPT and all those research-specific GPT-powered tools, is for brainstorming ideas, which you can do much more efficiently with AI than my currently limited brain can. Then information synthesis, especially synthesizing large swaths of research reports, whitepapers, and interview transcripts. And then editing copy for tone. A lot of people on social media marketing already use GPT to pretty powerful effect in this regard, like using GPT to generate the same social copy across different platforms in record time and accounting for the voice differences of each platform. Like, can you edit this brain dump I wrote at 3:00 AM and rewrite it in paragraph form in a way that has a more professional and analytical tone and is speaking to an industry audience that is not as familiar with AI — that’s an example of a prompt that I’ve definitely used.
So brainstorming, synthesis, and then editing for tone of voice. We’re still early in implementing that at a strategic level across the org. But it’s something I would like to do, to save a lot of time and just allow us to do better research at a lot of different stages of the process, especially at the brainstorming phase and at the editing phase.
I’m curious what else comes to mind specifically for you, Alex, in terms of tools and prototypes that you’re looking into building as well.
Alexander Flores: The tone of voice I think is doable. I’ll have to see how doable, as far as building a tool where, hey, we have somebody write this and then can map it to Water & Music voice. I know it’s technically possible. I don’t know if it’s cost-effectively possible and what the process is to train a model and all that.
So that’s more research I gotta look into, but that’s less existential. I think our main existential problem is more around synthesis, and I would add context. [01:20:00] We have a lot of information within our community, either structured through our past articles or in the conversations that happen in our Discord server, and being able to pull that information out is very critical to do soon. It’s interesting how the same conversations will bubble up and people will be new to the community and say, “Hey, I have this idea.” And it’s not that we wanna shut down the excitement, but it’s like, yeah, we had this conversation a long time ago, and it’d be valuable for everyone if it was easier to surface those conversations and understand what’s been discussed and where the current state of that conversation is.
There’s actually work happening in academia around this by this guy named Joel Chan. He wrote this extension for this tool called Roam which is like a note-taking, mind-mapping thing. His main thesis is that academia is stalling right now because it’s so hard to understand the information that is inside each one of these scientific papers. You get a paper, it’s like a five-page PDF document that’s extremely dry, and the information’s very deep inside of it. You can read the abstract, get an idea of what’s going on, but like the actual key nuggets of insight are somewhere in this two-column, tiny-font thing that you have to dig through. That process of trying to understand what is known in the community at large is adding a lot of friction and is slowing down progress. And so his tool was like trying to find a way where like people are annotating and extracting that information and being able to build a graph to where you can see the specific claims that we’re trying to make in science, and all the supporting evidence referenced across all these different places.
Now with language models, I think that’s a much more viable thing to build out — like these smaller community-level layers where here’s all the conversations we’ve had, we can extract information from those conversations and kind of have this space to make it easier to follow the conversations that are happening, and also surface references to past topics that have been discussed, so that there’s a more unified understanding of what’s going on.
As far as how, it gets interesting because I’ve been kind of thinking about it like signal processing, and trying to adapt things that are known to work there. So Even just for making the right prompt to summarize a chunk of text, right now there’s a limit on how much text you can put into GPT. I think it’s 4,000 tokens or something. And that 4,000 has to include the text you’re putting in and the text you expect it to spit out. It all has to fit into that window. And so if you just chunk up your document straight up, you’ll summarize a piece and it won’t have the context of what happened before it. And so thinking through different ways of how you efficiently hold context as you’re trying to summarize different chunks. There’s different techniques like using a sliding window to analyze the text versus splitting it up into discreet windows that don’t overlap.
I was inspired by how the Hubble telescope works and how those images look, beautiful images that you see from like James Webb and Hubble. Those colors aren’t actually real. They’re artistic interpretations of a bunch of different wavelengths that have been processed and each contain their own amount of information, and then they map that information to the visible color spectrum that we can interpret. And so, understanding that text probably has a lot of information beyond the straight level, like there’s a bunch of different layers of information to text. And so having separate distinct passes of the text that are extracted, like to extract entities, extract sentiment, extract relationships between who’s talking about what, extract different terms — like having different layers of processing for the same piece of text, and then having some sort of abstract representation of the information in that. And then eventually get that synthesized down either into a summary or into a graph form where you can relate all these ideas together.
Cherie Hu: Got it. So fascinating. Yeah. To the point of context being the most existential thing… How do I break this down? Okay I’ll break it down with a Water & Music example and a music-industry example.
With Water & Music, there are so many different forms of context that are just like living in people’s heads. Context on the history of the company mostly lives in my head, in the heads of maybe longtime members, definitely the most active Discord members who are actively shaping that context in the sense of shaping conversation, and then contributors to our seasons. That’s a very tiny percentage of the community as a whole.
And then there’s like [01:25:00] industry context on different ways that people process information or behave or react or do deals in different markets with different kinds of companies or platforms, that just lives in everybody’s heads. Or a lot of it is only communicated through word-of-mouth or negotiation as you mentioned earlier. That’s not on paper.
And so this goes to the industry example that you and I have debated a little bit about, which is, can ChatGPT ever get to the point of writing out a good artist strategy? I think we’ve tried a bunch of different prompts and the output was always very generic to me. I think we had a specific prompt of, okay, I am a Latin American underground experimental electronic artist who built a tight-knit following on SoundCloud, and I am about to release my debut EP or my debut album. And you know, x number of months or period of time from releasing the album, I would like to get to sign a deal with X label, because I think that’s important for my current marketing goals, which is to grow to X audience size, or to reach these markets. But even when we gave that context and asked ChatGPT what to do, the output was very generic even after iterating or asking it to expand on certain points. It just felt like it was missing industry context that I knew was sitting in my head, and was sitting in the heads of Latin American music-industry professionals, and I don’t know if that sits anywhere in a way that an AI would be able to sufficiently train on to output something that’s actually helpful or unique in a strategic sense.
Do you think we will get to the point where that changes? Or is there maybe always gonna be some kind of moat where, say, a strategist can come in and be like, oh, I know AI can do X, Y, Z in terms of generating ideas or content for me, but it can’t touch this function of strategy, and that’s where I will still have an advantage or still be the main resource.
Alexander Flores: I remember specifically that example you’re walking through. If you drill down and probe it enough, I think it did get to some interesting suggestions, but it required not just hoping that the first output or the second or third output was sufficient. It’s like, all right, here’s this one point out of 10 suggestions — let’s drill down into that point and start having a very in-depth conversation about how to execute on this one specific point.
And I think it did give you some useful information, especially if we’re going back to what I was talking about with AI doctors. Sure, it might not be the best in America and there’s better care to be had in America, but for people who have no fucking clue and have no access to those resources, it’s way better than nothing. It’s way better than traditionally what you would have access to. And so I thought it did give a decent, broad understanding of even just the things you should be thinking about.
As far as the context inside your head, this gets way more theoretical… Like eventually you’re gonna start communicating information over text. And if there’s a culture of people wanting to make private databases that are augmented with these natural language models to where that pairing becomes a tool and a reference and a bot that you can interact with. I think that will happen, and eventually somebody’s gonna decide, hey, I’m in the industry, I have all this knowledge, let me productize that into this thing. That’s one level — it’s very directly trying to extract that knowledge from the industry professionals into these models.
But then on a more abstract level, these concepts aren’t new. They’re not unique to the music industry, as far as negotiations and how you navigate these different problems. One of the things that Joel Chan was exploring is trying to map these problems and concepts that come up in different industries, where something that’s explored in one fashion here can be mapped into the same problem space in a different area, and understanding what maybe one side has learned.
Yeah, actually this is a general concept that happens in training AI on a meta level — you don’t have enough data in this one domain, but you have a lot of data in this other domain. And so you train up the AI in the one that you have enough data for, and then you fine-tune it for the one that you don’t. And so it’s trying to transfer a lot of its knowledge between the two and it works to varying degrees. So long-term, even if that information doesn’t disseminate, I think there will be systems built up to try and help people navigate it by referencing parallel problems in other industries that are more well-documented.
Cherie Hu: [01:30:00] Even independent of AI, that’s a problem the industry has. Institutional memory is the song I will keep singing until I die. And institutional memory as a high-quality source or input into AI models that can actually help with things like strategy — having that info layer first is critical. We’ll see if the industry gets to a point of recording that information systematically, regardless of what happens with it on the tech side.
Alexander Flores: M e t a d a t a!
Cherie Hu: Exactly.
Dang, okay. We definitely covered a lot of ground, both practical and very existential and theoretical. If you’re still listening, thank you. Hope this gave you a really interesting and helpful window into how we structured our season, and maybe gave you some ideas for how to start playing around with these tools yourselves. Especially if you’re looking to incorporate AI strategically into your brand or your company, hopefully some of our long-winded thoughts have been helpful in terms of offering potential frameworks to think about where it is or is not useful, at least right now.
Alex, do you have any last parting thoughts to people listening? And also if people are interested in diving more into any of the ideas that we talked about, where can they find you?
Alexander Flores: Yeah, I mean, my DMs are open on Twitter. What even is my handle? I think it’s ajflores1604. That’s probably the easiest public way to reach me. My email is alex@waterandmusic.com, I’m also available there.
Parting wisdom? Yeah, I don’t know. Things are changing so fast. I don’t wanna say try and keep up cause that’s a very daunting task for people. I almost wanna say, “good luck.”
Cherie Hu: I mean, that could be encouraging to certain people.
Something that I very much believe is true — I don’t wanna say every artist has to get on the bandwagon cuz yeah, there’s definitely a fault in taking that kind of normative stance. I do think if you want to have an especially deep level of understanding of the tech, and a say in how it evolves ethically and legally, and be able to work directly with developers who are shaping the future of landscape, there is no better time than now. The design space and opportunities for experimentation are just so open right now, both in music and also in text. Probably six months from now there will be more codified ways of doing things that will have stemmed out of the experimental work happening right now.
So that’s generally the feeling that I have and like why we chose this topic for Season 3, and why I’m excited to build with it myself as a writer, editor, and strategist. I feel like now is an especially ripe time to experiment, because there’s just so much flexibility.
Alexander Flores: Rewording what you said, I think we’re establishing the cultural norms now, and I feel like that’s stronger than going in and trying to codify it in law. I’m thinking about programming and how we’re actively building those up right now in the AI space. And so yeah, just joining that conversation and helping establish those, I think is very valuable.
This isn’t not even advice, but just so that people aren’t surprised, I think everyone’s gonna be wrong in two ways. Things are gonna change more than they expected, and then also there’s some things that are gonna stay the same, more than they expected. I feel like I’m not saying much with that, but I think it is worth pointing out that if you can figure out what those are, then that’s kind of your edge. What is ready to be shifted and disrupted and what actually people are saying is gonna be disrupted, but also understanding what won’t.
Cherie Hu: Thank you so much for your time. And yeah, feel free to reach out to either of us if you’re listening, if you have any general feedback on the season or want to hear more about Water & Music. We’re definitely all ears. Thanks so much!
Alexander Flores: Thank you. Bye!