Stephen A.I. Smith


2025-05-27 edit: I did this project in a week back when GPT-2 was OpenAI's latest model and Twitter was still Twitter. The code is messy and not maintained, and the Twitter account @real_stephen_ai no longer exists due to inactivity. I migrated this post anyway because I like coming back to read the tweets and watch the clips. Such a project is sadly meaningless now with the advent of modern AI, but I loved working on it nonetheless, and will *never* forgive the r/NBA mods for deleting my post just as it - and the Twitter account - were gaining traction. Consider me a hater 4 lyfe.

This is just so DISRESPECTFUL

Sometimes a series of events comes together so nicely for a project that I just have to drop every other thing I’m doing. This week, that resulted in Stephen A.I. Smith, a model that generates Stephen A. Smith style tweets, trained using the “small” 124 million parameter version of Open AI’s state-of-the-art GPT-2. Twitter accounts like these are nothing new, but it was my *duty* to make this particular one.

I want to make it very clear that I am in no way mocking Stephen A. Smith; I consider him one of the greatest television personalities out there, and he has been (and will continue to be) a perpetual source of entertainment over the past decade. This project started out as an exercise comparing RNNs to statistical time series models, but here we are.

Here are some hand-picked tweets that were generated. I replaced ‘@’ with ‘<at>’ to avoid spamming random accounts, and links in tweets were replaced with the <url> tag for data sanitation purposes. For the full effect, try to read these in his voice:

Some of the tweets are very coherent, others are complete junk, and quite a few are entertaining. Even though the GPT-2 model should produce good generated text, re-training on the twitter data introduces noise, since tweets typically contain poor spelling/grammar, and unknown tokens such as emojis and hashtags (not to mention the cornucopia of equivocal words that is Stephen A’s lexicon).

Another feature of the model is sentence completion, where you can input leading sentences and have the model finish the tweet. For example, leading with “The best player in the NBA is”, we get:

Leading with “The Lakers should trade”:

Leading with “Drake is without a doubt”:

Who is Stephen A. Smith?

This is not the first time Stephen A. Smith has been the subject of applied machine learning, but it is shocking that a Google search of “Stephen A. Smith text generator” yielded no similar projects. The man is an icon in the sports television community. He has about 100 jobs, most notably as the co-host of ESPN’s First Take, and is perhaps best known for his hilarious catchphrases, his insightful commentary, and his starring role on ABC’s General Hospital. If you have the right sense of humor, a few clips is all it takes to appreciate him. Here is a collection of some of his greatest hits (I take no responsibility for any changes to your YouTube recommendations):

How are the tweets generated?

Due to the prohibitive size of the released GPT-2 models, it was difficult to have a streamlined end-to-end process on my local machine (let alone my tiny t2.micro AWS EC2 instance), as even the smallest model is not easily re-trainable on my 1080 Ti. As a result, the process is a bit more nuanced, as illustrated below.

it's pronounced BLEU

Tweets are scraped using a modified version of this repo. Since Stephen A has many short tweets like “thanks” or “appreciate it” or “haa”, I filtered for the most popular tweets (100 or more favs), so that I can re-train the GPT-2 model on his greatest hits. The training itself is facilitated by this useful wrapper and the use of a Google Colab notebook. Once the model is done re-training, I can use it to generate new tweets, or to complete leading prompts. I typically generate 1,000 tweets at a time.

Of course, it would be very time-consuming to manually sort through the sea of output, so I built a separate scoring model (using essentially all the scraped Tweets) to classify tweets with 100+ favs. The generated tweets are fed through this scoring model, and I keep the top 100 to manually review. Finally, to remove any overfitted predictions, I compute the BLEU score of each of the 100 tweets (using the original tweets as reference), and filter out tweets who are too similar to existing tweets.

This project really made me appreciate the open source community; 10 years ago, doing a comparable project would have taken me months due to the need to code and learn each major step from scratch. To try it out yourself, the GPT-2 training data I used can be found here, and you can use the re-training template in this Colab notebook by Max Woolf.

Additional technical details