Exploring Goodreads Data: An Analysis of 10 Million Books

Goodreads is one of the largest book websites on the internet. It has data about millions and millions of books from different genres and in many languages. It’s hard not to find a book on Goodreads whether it's published hundreds of years ago or just a few days ago.

Today, I present the analysis results of more than 10 million books on Goodreads. In fact, the original dataset that I used had 50+ million books but I excluded 40 million of them for data quality reasons mentioned later in this article.

feat-img

Goodreads allows you to search for any book and view its info, but there is no way to see all the available books and interact with them. Using the data in this analysis, however, I was able to do just that with millions of titles. Below, I’ll share some interesting findings and provide a method for further exploration at the end.

Continue reading to know more about the analysis and the data or you can jump directly to the results section. But don't also forget to read about how to get the most out of this analysis.

Table of Contents

star marks the sections that I find most interesting.

Data Used

I used an obscure dataset published on Kaggle a few years ago. Its size uncompressed is 90 GB which made it hard to deal with on my personal computer, so I used cloud services (on AWS) for that. More on that below.

The dataset contains information about 50+ million books published throughout the years up to 2021. It contains information about books including title, author, publisher, publication date, average rating, number of ratings, number of reviews, number of pages, categories assigned by users, format, and more.

Method and Tools

As mentioned above, the dataset is large (90 GB) and it’s hard to deal with on a personal computer. So I used cloud services to process the data and extract the information I needed for the analysis. I used AWS services for that. I uploaded the data to S3 then used Glue to discover the data on S3 and define a structured table on top of it. From there, I used Athena to query the data, explore it, and extract the subset I needed for the analysis.

The original dataset contains 50+ million books but most of them are not useful for the analysis. For example, some "books" are actually journals, notebooks, planners, calendars, etc. so I excluded them. I also excluded books with no ratings at all, meaning no one had rated them on Goodreads. After that, I ended up with around 23 million books.

Some of these 23 million books were actually the same book but in different editions. For example, a book can have an English edition, a French edition, a Spanish edition, etc. It also can have multiple editions in the same language. I wanted to analyze books, not editions, so I selected one edition for each book. I ended up with around 9 million unique books.

I Used SQL in Athena to clean the data and make it more structured and ready for analysis. Then I downloaded the data to my computer and started the analysis. To perform the analysis, I used Python, Pandas, NumPy, scikit-learn, OpenCV, and other libraries for data processing and analysis inside a Jupyter Notebook.

For the interactive data visualization that you will see below, I used D3.js and in a few cases Observable Plot. This is my first time I extensively use D3.js for data visualization and I found it very powerful and flexible. It's now one of my favorite libraries although it has a steep learning curve. I've used Matplotlib, Seaborn, Plotly, and other libraries for data vis before but D3.js is different and more powerful in many ways.

How to Use this Analysis

Here are some quick important notes to help you get the most of this analysis.

Filtering: Fiction, Non-Fiction, or All

The analysis covers millions of books. You can filter the analysis results to see only fiction books, only non-fiction books, or all books (the default option.) You can do that by clicking on the filter buttons (tune) on the right side of the screen. When you select an option, all charts and numbers on the page change to reflect the selected option.

You can also use keyboard shortcuts: Shift + F for fiction, Shift + N for non-fiction, and Shift + A for all.

Interactive Charts

This analysis is interactive, meaning you can hover over the charts to see more info about the data points. For example, you can hover over a book to see its title, author, and other info.

tune

Now let’s begin with the analysis results...

Top Books

Let’s start with the top books on Goodreads. We will look at the most rated books, most reviewed books, and top rated books.

Each book on Goodreads can be rated by users between 1 and 5 stars. The average rating of a book is a good indicator of its quality. The number of ratings a book receives is also a good indicator of its popularity.

Most Popular Books

The following chart shows books that received the largest number of ratings, indicating that they are the most popular on the platform. Hover over any bar to see more info about the book. Click on a book cover to go to its page on Goodreads.

The histogram below shows the distribution of book rating counts. Note that the y-axis is log scaled (i.e. it goes like 1, 10, 100, 1000, etc. instead of 1, 2, 3, 4, etc.) Hover over any bar to see the number of books with that rating count.

Most Reviewed Books

In addition to rating a book between 1 and 5 stars, users can leave a written review where they talk about their opinion about the books, their notes, recommendations, etc.

In addition to indicating popularity like the number of ratings above, the number of reviews points out the impact of the book. In other words, it’s easy to rate a book but to write a review, it usually means you have more to say about the book, whether positive or negative.

The following chart shows books that received the highest number of reviews. Hover over any bar to see more info about the book. Click on a book cover to go to its page on Goodreads.

When searching for the top-rated books on Goodreads, it's important to consider not only the average rating but also the number of ratings each book has received. While there are some books with a perfect 5/5 rating, they often have a low number of ratings, which can make the rating less reliable.

To find the most highly-rated books on Goodreads, I used two methods that you can choose between:

Weighted Rating

The weighted rating takes into account both the average rating (out of 5) and the number of ratings a book has received. Average rating is a crucial metric, but relying solely on it can be misleading for books with very few ratings. So we add number of ratings to the formula to enhance the reliability. A high number of ratings can indicate a more trustworthy average rating.

The weighted rating formula combines these two metrics as follows:

\[ \begin{align} \text{Weighted Rating}_i = \frac{n_i \times r_i}{\sum_{j=1}^{m} n_j} \end{align} \]

Where: - \(i\) is the index of the current book - \(n_i\) is the number of ratings for book \(i\) - \(r_i\) is the average rating for book \(i\) - \(m\) is the total number of books - \(\sum_{j=1}^{m} n_j\) is the sum of the number of ratings for all books

This formula ensures that books with a higher number of ratings and a high average rating will rank higher than those with fewer ratings, even if they have a perfect 5/5 average rating.

Percentage of 5-Star Ratings

Another approach to finding top-rated books is to look at the percentage of 5-star (or 5/5) ratings the book has received. For instance, if a book has a 5-star percentage of 40%, it means that 40% of the ratings given to the book are the highest possible rating of 5/5. The remaining 60% of the ratings are split among lower ratings.

By focusing on books with a high percentage of 5-star ratings, we can identify those that have not only received high ratings but have also maintained that high level of satisfaction among a significant number of reviewers.

Note: For the two methods mentioned above, we excluded books with less than 500 ratings.

Use the filter below to switch between methods. Hover over any book cover in the chart to see more info. Click on a book cover (if you're not on a mobile device) to go to its page on Goodreads.

See top rated books by:

Average Rating Over Time

The following chart shows the average rating of books published each decade from 1700s to 2010s. It allows us to see how book rating tend to change over time. Hover over the line to see the average rating for each decade.

Note: Books with less than 50 ratings were excluded from this analysis.

The following chart shows the "top" book over the years since 2000 in four different categories:

  • Most Popular Books: the book with the highest number of ratings each year.
  • Highest Rated: the book with the highest average rating (out of 5) each year.
  • Popular, Yet Not So Loved: the book with the highest number of ratings in a year, but whose average rating (out of 5) is below average that year.
  • Hidden Gem: the book with the highest average rating (out of 5) in a year, but whose popularity (number of ratings) is below average that year.

Note: Books with 500 ratings or less were excluded from this section.

Hover over any book to see more info about it.

Book Topics and Genres

The following word cloud shows the most common words in book titles over time. This can give us an idea of the most popular topics in books over the years.

Select a decade below to see the most common words in book titles for that decade. 1950s is selected by default.

In the word cloud, word size reflects its frequency in book titles: the bigger the word, the more it was used in book titles. Also, when you go from one decade to another, new words—that weren't common in the previous decade—will appear in a different color. This helps you see how the most common words in book titles have changed over time.

Hover over any word to see the exact number of times it was used in book titles and some of the book titles where it was used.

Select a decade:

Genres Popularity Over Time

The following chart shows the popularity of different genres over the years between 1901 and 2013. It allows us to see how the popularity of different genres has changed over time. We focus on 10 of the most popular genres on Goodreads. Popularity is measured by the number of books published each year under each of the genres.

Use the checkboxes below to remove/add genres to the chart. Hover over a line to see more info about the genre popularity over the years.

Select genres to show:

Do you want to look at the data* and interact with it? Click the button below to get access to a dashboard with hundreds of thousands of books to explore.

*The dashboard show books used in this analysis that have more than 500 ratings.

Authors

Let’s now explore book authors. We will look at the most prolific authors, the most popular authors, and the top-rated authors.

Most Prolific Authors

The following graph shows the authors who produced the largest number of books. Hover over any bar to see more info about the author and their most popular books.

Only books with more than 50 ratings were considered to determine the most prolific authors.

Most Popular Authors

The following graph shows the authors whose books received the largest number of ratings. This is done by combining the number of ratings for all books by the author. Hover over any bar to see more info about the author and their most popular books.

Top Rated Authors

The following graph shows the authors whose books received the highest average ratings. This is done by calculating the average rating for all books by the author. Hover over any author name to see more info about the author and their most popular books.

Author rating is calculated by averaging the ratings of all books by the author weighted by the number of ratings each book received. The following formula was used to calculate each author rating:

\[ \begin{flalign} & \text{Author Rating} = \frac{\sum_{i=1}^{n} w_i r_i}{\sum_{i=1}^{n} w_i} & \end{flalign} \]

where \( r_i \) represents the rating of the \( i \)-th book, \( w_i \) represents the number of ratings (or the weight) that the \( i \)-th book received. The symbol \( \sum \) represents the sum over all books from 1 to \( n \), where \( n \) is the total number of books.

Only authors who authored more books than the average number of books per author and who received more than 50 ratings per book on average were considered.

Publishers

Now let’s head to the publishers. Similar to authors, we will look at the most prolific publishers, the most popular publishers, and the top-rated publishers.

Most Prolific Publishers

The following graph shows the publishers who published the largest number of books. Hover over any bar to see more info about the publisher and the most popular books they published.

Only books with more than 50 ratings were considered to determine the most prolific publishers.

Most Popular Publishers

The following graph shows the publishers whose books received the largest number of ratings. This is done by combining the number of ratings for all books published by the publisher. Hover over any bar to see more info about the publisher and their most popular books.

Top Rated Publishers

The following graph shows the publishers whose books received the highest average ratings. This is done by calculating the average rating for all books published by the publisher. Hover over any publisher name to see more info about the publisher and their most popular books.

Publisher rating is calculated by averaging the ratings of all books published by the publisher weighted by the number of ratings each book received. The following formula was used to calculate each publisher rating:

\[ \begin{flalign} & \text{Publisher Rating} = \frac{\sum_{i=1}^{n} w_i r_i}{\sum_{i=1}^{n} w_i} & \end{flalign} \]

where \( r_i \) represents the rating of the \( i \)-th book, \( w_i \) represents the number of ratings (or the weight) that the \( i \)-th book received. The symbol \( \sum \) represents the sum over all books from 1 to \( n \), where \( n \) is the total number of books.

Only publishers who published more books than the average number of books per publisher and who received more than 50 ratings per book on average were considered.

Book Series

A series is a collection of books that are connected by a shared theme or story. This section of the analysis focuses on book series. of the books we are looking at are part of a series. In this section, we will look at the longest series and the most popular series.

Longest Book Series

The following chart shows the top 20 series with the largest number of books. Each small rectancle represents a book in the series. Hover over a series title to see more info about it including its most popular books.

Only books with more than 50 ratings were considered to determine the longest series.

Most Popular Series

The following chart shows the series that received the largest number of ratings across all their books. Hover over any bar to see the most popular book in the series along with other info.

Number of Pages

Let’s now talk about the number of pages. We will look at the distribution of the number of pages for the books then see how the typical number of pages change over time.

Number of Pages Distribution

The following chart shows the distribution of the number of pages for the books; it allows us to see the variety in book lengths. The x-axis represents the number of pages and the y-axis represents the number of books with that number of pages. Hover over any bar to see the number of books with that number of pages.

Note that the y axis is log scaled.

Books that don't have an actual number of pages (like books in audio formats) and books with more than 10,000 pages are excluded from this analysis.

Number of Pages Over Time

The following chart shows the median number of pages for books published each decade between 1700s and 2020s. This is a good indicator of the typical length of books over time. We use the median, not the average, because there are a few books with an extremely high number of pages that would skew the average, so the median is more robust in this case.

Books that don't have an actual number of pages (like books in audio formats) are excluded from this analysis.

Book Covers

Popular Colors in Book Covers

The following chart shows the most popular colors in book covers. You can see that we show colors for three groups of books:

  • Most rated 1,000 books
  • Top rated 1,000 books
  • Random sample of 1,000 books that are not in the above two groups

The size of each color reflects its popularity, meaning colors with wider areas were used more frequently in book covers.

To get the most popular colors in each of the three groups mentioned above, I downloaded the images of the 1,000 book covers in the group, extracted the colors of all pixels in the images, converted the colors to the CIELAB color space which is designed to approximate human vision, then ran a clustering algorithm on all the colors to group them in 15 clusters. After that, I fetched the colors at the centers of the clusters and the number of pixels that belong to each cluster and the results are what you're looking at below.

Books with no cover image or with 50 ratings or less were excluded from this analysis.

If you've enjoyed this analysis and you want to look at the data* and interact with it? Click the button below to get access to a dashboard with hundreds of thousands of books to explore.

*The dashboard show books used in this analysis that have more than 500 ratings.

👋
If you have any questions, feedback, or suggestions, you can email me at ammar5656 at gmail dot com, or you can reach me on X (link below)