Welcome to ‘In-Depth,’ where we explore AI and music topics, covered in great detail. Today, we focus on predictive streaming, the foundation of our company. This article might take you about 11 minutes to read.
Predictive streaming models are AI systems that forecast the popularity and reception of songs on streaming platforms. With the ability to anticipate how musical works will resonate with listeners, these models provide valuable insights for artists, labels, and industry professionals making high-stakes decisions. But what exactly are predictive streaming models, and how are they used?
In this essay, I’ll share the perspective of our company, Unbias, on predictive streaming based on our research and hands-on development in this space. Specifically, we’ve moved towards more nuanced modeling that goes beyond using one data type to understand audience engagement. After explaining different types of models, I’ll discuss how combining multiple data types—from audio to text to tabular data help to understand listener engagement.
Predictive Streaming Models
Predicting streams is a difficult task, but from what we have uncovered from our research, customers and partners in the field of streaming, it is possible to estimate a song’s trajectory with high accuracy. The goal of a prediction is to manage change and inform many important decisions related to our art. These activities could be marketing, A&R, financial valuation or even curiosity. From an artist’s perspective, predictive streaming can be a proxy for feedback for a rough mix. For an A&R or label manager, it could be used as an analysis tool against their existing forecasting methods. For distributors deciding to make an advance on a new artist, they could easily calculate their recoupment date to possibly advance more or less.
Diverse perspectives exist within predictive streaming depending on the platform and the intended use case. Spotify, for example, although primarily known for its recommendation engine, utilizes similar underlying data to inform predictions related to playlist placements or identifying trending songs. TikTok might employ predictive analytics to gauge the virality potential of a particular track, based on various engagement metrics and user behavior. In both cases, the end goal might differ—be it user engagement or more accurate song placements—but the underlying imperative remains the same: to anticipate how an audience will react to a given piece of music. Being prepared in this way enables us to serve the audience better.
Architecturally, predictive streaming models differ from AI music generation and ChatGPT. They focus narrowly on predicting streaming from different types of data inputs. This makes their behavior more understandable compared to emerging generative AI. But effectively combining different data modalities remains an active challenge when building these models. The end objective is accurate and fair systems that provide value to the music community. It is with this foundational understanding of predictive streaming that we venture into the technological nuts and bolts of these models, beginning with how they learn from data.
Learning
We’ll begin by explaining how computers learn from data, the way data is represented and how a model can predict streams. After that, we’ll delve into different designs of stream predictions, binary classification, regression and forecasting. Finally, we’ll discuss the different types of data on which these models are trained, and why strong models don’t necessarily require a large dataset, but rather the intelligent amalgamation of various types of data and models
First, predictive streaming models need a way to learn from available data. We can give computers many types of data, like audio, text, images or tabular (CSV files), related to the song or artist. The computer needs to transform the data into numbers that it can understand. Each type of data is transformed in a format that is efficient for computation and contributes to the overall model performance.
You can imagine a model as a student learning by observing. You start by teaching it different types of learning tasks. An example of a common task is to learn to identify between Jazz and Pop songs. The way you teach it is by sharing many audio files of both genres. Ideally, you want to share the same amount of examples. The student learns by observing various features of the audio and assigning weights to each learned feature, indicating its importance in distinguishing between Jazz and Pop. When a student learns to identify two categories, that’s called a binary classification task.
After it’s learned from many examples, we test the student by giving a new song it has never seen before. It then uses its learned weights, multiplying the song’s features to sum them up. This sum goes through a special formula called the “sigmoid function,” which transforms the sum into a value between 0 and 1.
This value is the student’s confidence level that the song belongs to a particular category, say Jazz. For instance, a value of 0.8 would indicate 80% confidence that the song is a Jazz song. This method of learning is called Supervised Learning, and it is “supervised” because we shared with the student prior examples to learn from. The algorithm used in this example is a Logistic Regression.
The same Logistic Regression model can also learn to identify different types of songs grouped by their streaming performance. For example, giving it examples of songs that reached 500K streams in 30 days versus ones that didn’t. This type of classification model is beneficial because it’s an easy and quick prediction for understanding if the song has potential.
Another type of learning task is to generate or predict a number by observing many other numbers seen in the data. For example, predict the number of streams of a Jazz song within the first month after release. Similar to logistic regression, each feature of an audio is assigned a weight. These weights are learned during the learning process, and they indicate how much each feature contributes to the final generation of the number.
To make a prediction for a new song, the model takes the song’s features and multiplies them by these learned weights. The results are summed up, and sometimes a constant number known as a “bias” is added to this sum. Unlike logistic regression, which outputs a probability of 0 and 1, this model gives you a direct numerical estimate. This type of algorithm is called Linear Regression.
The final type of model I want to share is Forecasting.
In the context of predictive streaming, forecasting is like looking into a crystal ball, but backed by data and algorithms rather than mysticism. It’s the model’s way of saying, “based on everything I’ve seen so far, here’s what I think will happen next.” But instead of making a single prediction for the next moment, it makes a series of predictions for multiple future steps.
So, let’s say we have data on how a song has been streaming for the past three months. Forecasting would involve the model taking this past data to predict the streaming numbers for the next week, month, or even longer. It’s not just saying, “Tomorrow will look like this,” but also adding, “And here’s what the trend will probably look like for the next few weeks.”
The way the model works is that it identifies trends within the data. The data itself is unique, differing from, let’s say, identifying Jazz or Pop songs. The data is organized like a series of daily events. For instance, on day 1,300 streams, day 2,500 streams and so on, for many songs and artists. The model might notice that a song gains more streams on weekends or sees a boost after being featured in a popular playlist. The model then uses mathematical equations to best capture these observed patterns. This fitted equation serves as the model’s “rulebook” for making future predictions.
When it’s time to forecast future streams, the model uses this rulebook. It takes into account the most recent data and combines it with the identified patterns to project what the next set of data points—in our case, the number of future streams—might look like. So unlike Logistic Regression which outputs probabilities and Linear Regression outputs one number at a time, forecasting models can be designed to output a sequence of numbers. The weather for the next 10-days in your app is an example of a forecasting model.
So, if your song has been steadily gaining streams for the last 30 days and suddenly experiences a jump in numbers due to a playlist feature, the model will incorporate both the steady growth and the sudden increase into its future predictions. However, it’s good to remember that these are still “educated guesses,” influenced by the quality and quantity of past data the model has seen.
In the world of predictive streaming, this is useful for planning out longer-term activities. For example, if the model forecasts that a song will see a dip in streams, that could be a cue to ramp up marketing efforts. Or if it sees a potential spike, maybe it’s a good time to release an accompanying music video to capitalize on the song’s growing popularity.
Forecasting is inherently about dealing with uncertainty. Even the best models can’t predict the future perfectly. External factors like a song suddenly going viral on social media can always throw off predictions. So, while forecasting gives us a likely script of future events based on past data, it’s important to use it as a guide rather than an absolute certainty. Forecasting is a hard problem to solve because the data itself is a series of many events that’s impossible to qualitatively analyze.
But how do we quantify the effectiveness of these predictive models? This leads us to the concept of accuracy.
Accuracy
To gauge the effectiveness of a linear regression model, we usually examine the “error” in predictions made for songs where we already know the actual outcome. The error is the discrepancy between the model’s prediction and the actual number of streams. For example, if the model predicts 1,000 streams for a song that actually garnered 1,200, the error is 200.
We square these errors and then average them across all songs in our sample dataset. This average is known as the Mean Squared Error (MSE), which measures the model’s average prediction error.
Squaring the errors serves two purposes: it eliminates negative values, ensuring that underestimates and overestimates don’t cancel each other out, and emphasizes larger errors, making substantial discrepancies more noticeable.
A lower MSE indicates a more accurate model, while a higher MSE suggests room for improvement. It helps us assess the model’s performance in terms of how close or far it is from spot-on predictions.
Model training often involves a tradeoff between overfitting and generalizability, both of which impact accuracy. Overfitting occurs when a model becomes too specialized, memorizing specific patterns in the training data that don’t generalize well to new data.
For example, a model trained on 1,000 songs might memorize specific audio patterns linked to individual song nuances. While this leads to accurate predictions for known songs, the model will struggle to generalize these learnings to new songs.
In contrast, a generalizable model finds broader patterns, like “more energetic songs tend to have higher stream counts,” enabling it to make decent predictions even for new songs.
The ultimate goal for most practitioners in predictive streaming is to minimize error rates. This turns the machine learning field into an empirical, scientific process involving ongoing data collection, model training, error analysis, and refinement.
Many other industries like in healthcare, real estate, and finance benefit from open-source datasets and shared academic research, while the music industry often lacks such resources. Currently, data is either hidden or purchased from third-party vendors, limiting the field’s advancement. Therefore, we often depend on internal benchmarking methods to assess a model’s effectiveness.
This reliance on limited data points to a broader question: How can we improve the predictive accuracy of streaming models? One promising approach involves incorporating multiple types of data, or “modalities,” into our models.
Multimodal
In predictive streaming, underperforming models often indicate incomplete or insufficient data. To improve predictive accuracy, models should account for as many variables affecting streaming as possible. One approach to this is the inclusion of multiple types of data that reflect the diverse digital experiences of listeners.
These modalities can include audio signals, news articles, PR, social media comments, previous streaming data, genre, and market information. Think of each modality as a “brick” in a structure. Each brick needs an encoder to transform its respective data type into rich features that help improve the model’s performance.
For audio data, encoders transform fractions of audio signals into numerical values that computers can understand. For example, Meta’s EnCodec, used in their AI music generation model MusicLM, serves this purpose. Unbias is also developing its own open-source audio encoder.
Text-based data, such as news articles, lyrics, and comments, are fed into a text encoder. This encoder aims to understand the meaning and relationships between words. Before processing, the text is broken down into smaller “tokens,” much like slicing a loaf of bread. This process is known as tokenization.
The text encoder doesn’t just skim the content; it dives deep into understanding word relationships. For instance, if it reads a book on Jazz music and frequently sees “Miles Davis” associated with the word “trumpet,” it’s more likely to link the two in future contexts.
Other types of data, such as those found on Chartmetric or SoundCharts—playlisting, social media engagement, TikTok plays, chart positions, and so on—could be used are also transformed by encoders into numerical values.
After transforming different types of data through their respective encoders, the next step is combining these encoded features for predictive modeling. In the context of a linear regression model, this is more straightforward than it might initially seem.
Each encoder outputs numerical values, vectors, that represent the essence of the original data, whether it’s audio signals, textual relationships, or playlist metrics. These vectors become the feature set for our Linear Regression model. Importantly, these aren’t just arranged randomly, it takes experimentation to get them right.
Imagine these feature vectors as ingredients in a recipe. Alone, each has its unique flavor and texture, but combined in precise proportions, they create a dish greater than the sum of its parts. In our model, each encoded feature set—whether it’s from audio, text, or charts—serves as an ingredient. The Linear Regression model learns the best “recipe” to predict future streaming numbers, leveraging the diversity and richness of the multimodal data.
The key here is to remember that each modality is like a unique lens through which to understand the music’s appeal. Combining these lenses gives us a much richer, more nuanced view. The linear regression model serves as the framework that holds these lenses in place, allowing us to see clearly what drives a song’s streaming success.
After combining different types of data through their respective encoders, the next challenge lies in understanding how each modality, or ‘ingredient,’ affects the model’s predictions. This is where concepts like Feature Importance come into play. These techniques help in quantifying the impact of each feature on the prediction. For example, feature importance can tell us how significant ‘social media engagement’ or ‘previous streaming data’ are in predicting a song’s future success. Similarly, other advanced methods provide more granular insights by explaining the effect of a single feature value in the context of a particular stream.
Understanding the importance of individual features enables artists, labels, and marketers to take targeted actions. If ‘playlisting’ turns out to be a highly influential feature, then focusing on strategies that get a song featured in popular playlists could be beneficial. Or if ‘genre’ is highly impactful, artists could consider cross-genre collaborations to tap into different audience bases. In essence, these techniques not only enhance our predictive models but also offer actionable insights that can be directly applied to real-world strategies.
As it stands, predictive streaming models and audio encoders are relatively obscure in the marketplace, and none are openly available as open-source tools. This lack of availability may be due to either technical challenges in building these models or their insufficient accuracy in replacing current methods. However, the advantages of developing a robust, open-source model could be game-changing for artists. The key criteria for success in predictive streaming models include not only accuracy but also safety, ethical considerations, and interpretability.
Takeaways
Predictive streaming is a rapidly evolving field with significant implications for the music industry. It sparks important questions: How can predictive models amplify an artist’s reach without diluting their craft? Do we risk creating a formulaic music landscape by relying too much on data? How can we democratize access to data for upcoming artists?
In our current ecosystem, streaming numbers hold immense sway, affecting everything from an artist’s market value to their ability to book gigs. Yet, my perspective is that predictive streaming models won’t replace the intuition and expertise of industry professionals; instead, they will serve as valuable tools to guide decision-making. For instance, a record label might use predictive streaming data to identify under-the-radar talent, while a distributor could use it to help market the artist the right way.
The future I see involves more transparent sharing of AI tools between artists, labels, and technology providers. The openness of this data could bring about a more equitable music industry, where both established and emerging artists can leverage insights to better connect with their audience.
The artists poised to benefit the most from predictive streaming are those who understand how to fuse data with creativity. They will use predictive models as a sort of “digital A&R,” identifying trends and preferences, but not letting the numbers dictate their art. Continuously feeding the models different editions of their songs, and testing the variations quickly on TikTok to accelerate their traction. For these artists, predictive streams could serve as a creative catalyst, enriching their understanding of audience preferences without stifling their artistic voice.
However, there are ethical concerns to consider, such as how predictive algorithms might unintentionally reinforce existing biases in the music industry. For instance, algorithms trained on historical data may perpetuate gender, racial, or genre-based disparities in exposure and opportunities.
If the majority of popular songs in the training set are by male artists from certain genres, the model might be more likely to predict higher success rates for similar tracks, thereby sidelining female artists or those from underrepresented genres. Additionally, algorithms could also be susceptible to commercial biases, favoring artists backed by larger labels over independent or emerging talents.
Therefore, as we advance in this space, transparency in how these models work and are trained will be vital to ensure fairness.
In conclusion, predictive streaming offers an exciting layer of possibility for the music industry, but it’s not a one-size-fits-all solution. The real magic happens when data insights align with artistic intuition, allowing the music to resonate on both a commercial and emotional level. As technology continues to evolve, so too will the symbiosis between data science and artistic creation, leading to an enriched musical landscape that benefits both artists and audiences.
Unbias is building an AI product with marketing capabilities. Join our waitlist here to get early access.