Book Store Data Set Analyzed (2024)

Published in

Dev Genius

7 min read

Dec 14, 2021

Extra: Using R to Analyze a bookstore data set.

I was scrolling on Kaggle once again because I decided to find an interesting dataset to work on to improve my data cleansing, exploratory analysis and predictive analytics. In a few minutes, I discovered the Bulk BookStore Dataset and thought it would be interesting to try and work on this data set.

What is the distribution of Numerical Variables (Prices, Pages, Weights & Cases)?
What is the distribution of Numerical Variables by Formats?
What is the frequency of the formats?
What is the frequency of books by language?
How many genres are recognized?
Years of books published
Count of books by Language & Format
Text Mining Analysis of the Descriptions of books
What is the relationship between Price and other numerical variables?
Predicting Price of Books with Numerical and Categorical explanatory variables using Basic Regression
Basic Model Assessment & Selection

Where is the data sourced from?

The data can be accessed via this URL: https://www.kaggle.com/yamqwe/bulk-bookstore-dataset

How was it collected?

The data was collected by crawlfeeds.com through web scraping.

What is the size of the data set?

The data set contains 1000 observations (books).

What are the number and types of variables in the data set?

There are 20 variables in the data set. The variables consist of quantitative and categorical data relating to the books Description, Publisher, Price, Weight and Format.

R Programming

In case there are more technical users reading this that have an interest in having a look at the source code for the analysis. Feel free to access the code via my Kaggle note book: https://www.kaggle.com/selvynallotey/bulk-bookstore-eda-basic-regression

I engineered two new variables to help with my analysis of this dataset. These include:

Genre
Date_Published

The cleaning of the data is also included in the Kaggle notebook link I referenced above.

What is the distribution of Numerical Variables (Prices, Pages, Weights & Cases)?

Figure 1.1

Looking at the histogram most of the data according to their variables appears to be positively skewed representing a rather unsymmetrical dataset. Besides this, most of the book prices fall between roughly $7 to $20. Most of the books have pages between 250 pages to 500 pages a book.

Figure 1.2

The board book and hard cover samples sizes are little as compared to the Paper Back Format regarding price in the dataset. There also seems to be some left skewness associated with the Paper Back box plot for Price, besides this the hardcover box plot also shows more outliers than the rest of the other formats considering price. (See Figure 1.2)

Figure 1.3

These are the rest of the boxplots for the numerical variables by format. The weight variable does show a significant number of outliers within the hardcover format. The medians of the different format are quite different from each other and the plot also highlights the spread of different the different formats. (See Figure 1.3)

Figure 1.4

The paperback books dominated the book store reflecting over 600 of the books stored while hardcover books come in second. I would assume the book store has chosen to focus on selling more Paperback books as they’re much cheaper than Hardcover or Board books so they could sell the books with high volumes and not necessarily optimize the price of Hardcovers. (See Figure 1.4)

Figure 1.5

According to the dataset, there are only two unique languages of books discovered. These languages included English and Spanish with English books making up for roughly the entire dataset and Spanish books seemingly insignificant. (See Figure 1.5)

Figure 1.6

The Genre Variable is a variable I engineered from the BreadCrumbs variable. There were in total 47 genres after engineering the variable. However, I decided to slice that to show the to 10 Genres most represented in the Dataset. Fiction and Juvenile Fiction came in first and second accounting for more than half of the books available in the dataset. It is safe to say that this book store caters to a middle school audience quite a lot. (See Figure 1.6)

Figure 1.7

The raw description variable was also provided in the dataset and I attempted to clean it to eliminate all the html tags from the variable column. What I did not attempt to do is remove the generic words that dominated the word cloud which would not provide any insights regarding the theme of most of the stories in the books or their story line. So essentially, most of the dominant words ended up being rather generic relating the the sales of all the books and not necessarily their stories. I would attempt to eliminate this in new versions of the code later on to provide a much more insightful analysis using text mining.

It was important for me to try to discover some relationship between variables. I decided to test for a linear relationship between Price and the other numerical variables as that appeared to be the only relevant analysis which would be useful to the book store owner.

Figure 1.8

Plotting a scatter plot of Weight vs Price, Cases vs Price and Pages vs Price. The only variable which had no real relationship with Price was pages as it showed a correlation of 0.28. However, Cases vs Price showed a negative moderate relationship conveying a correlation of -0.61. Finally the Weight vs Price showed a really strong positive relationship conveying a correlation of 0.73. The positive relationship graph shows a good and close fit to the straight line — this confirms that Pearson correlation is an appropriate measure for the strength of a relationship between the variables Weight and Price.

Using the Moderdive package in R, I was able to build a very basic regression model to attempt to predict prices of books using weight as an explanatory variable.

Table 1.1

The weight variable was right skewed so I decided to use log10(Weight) for my linear regression model to have a much more symmetrical variable and did the same for price by transforming it to log10(Price).

Model 1: lm(log10(Price) ~ log10(Weight))

Intepretation

The table shows that for every unit increase in weight there is a 0.604 increase in the price of a book.

Thus the equation of the regression line in table 1.1 follows:

log10_Price_hat = 0.52 + 0.604 * Log10_Weight

The intercept 0.52 is the average price for those books that had a weight of 0, however this really does not have a practical interpretation since weight of 0 practically means nothing.

Furthermore, I also decided to make a different model using the Format variable as a categorical explanatory variable.

Table 1.2

Model 2: lm(log10_Price ~ Format)

Interpretation

The intercept corresponds to the mean log10_Price of Board Books of 0.853.
Format: Hardcover corresponds to books with hard cover formats and the value is +0.438 is the difference between the mean log10_Price relative to Board Books. Essentially, the mean log10_Price of books with Hardcovers is 0.853 + 0.438 = 1.291.
Format: Paperback corresponds to books with paperback formats and the value is +0.247 is the difference between the mean log10_Price relative to Board Books. Essentially, the mean log10_Price of books with Hardcovers is 0.853 + 0.247 = 1.1.

Lastly, to assess the models, I used sum of squared residuals and r- squared to compare the models and make a selection.

I run these assessments in the R source code for my analysis which is publicly available on Kaggle.

Sum of Squared Residuals (Model 1)

The results of the sum of squared residuals was 15.8 for model 1.

Sum of Squared Residuals (Model 2)

The results of the sum of squared residuals was 26.3 for model 2.

R-Squared (Model 1)

The results of R-Squared was 0.558 for model 1.

R-Squared (Model 2)

The results of R-Squared was 0.262 for model 2.

Model Selection

Based on the Sum of Squared Residuals Assessment, the first model is more appropriate for predictions as it has a lower sum of squared residuals.

Lastly, based on the R-Squared values computed Model 1 accounts for a higher variation than Model 2 as it accounts for more than 50% of the variation.

The fiction Genre was the most popular book genre in the dataset. It is likely that you can predict the prices of books based on the weight of the books.