Why I tried scraping prices of books I read — a non-technical sharing
I love reading books, especially the free books that I can borrow from community libraries! Beyond cost efficiencies, I like how the library gives me the plenty of books and topics to explore. In fact, I even wrote a Python script to scrape books I bookmarked from the Singapore National Library Board’s (NLB) website to have a single view on their availability (see my Medium posts about this from here and here). Recently, I also applied to the NLB API service, and I have a few ideas on how I can use their API.
While waiting for my API access, I thought of a small data project with this list of books that I have read. The aim of this post is to showcase to a broader non-technical audience the thought processes of a proof-of-concept (POC) data project, so that they can appreciate the nuances that goes into executing a small scale data project. To those interested, I will also be sharing my Github repo link below.
The Project
Since 2011, I have been recording the books that I have read.
My list currently has more than 200 books, and while updating this list recently, I got curious about how much all these books actually costs, and indirectly, how much value I am gaining from NLB! Putting it into a more structured problem statement,
How much dollar savings have I gained from the library books that I have read?
So, I just need to get book retail prices, right? Here is where fun begins.
The list
200+ books is neither too large nor small. I could just copy and paste their retail prices from a book seller website into a spreadsheet and do my calculations from there. Assuming it takes 30 secs to get the information for one book, doing so for about 240 books will take me around 120 minutes (2 hours) to complete. Tedious, doable, but also extremely boring and definitely as not worthy of a Medium post. Later in this post, I will share that I took a total of 3 days to complete this project using a scripting method instead. However, I did want to test a more scalable approach that I could potentially structure as a “service” to allow others to understand their dollar savings if they had such a book list too. And honestly, writing a post on copying-and-pasting data is quite boring.
Fetching book prices — API
When necessary and possible, I prefer to get my external data from APIs sources over a web scraping method. Platforms take the time to set up the infrastructure to provide their APIs, so APIs usually have better data availability. APIs are a way to allow useful programmatic access to information that platforms are willing to share, without choking up the user experience for everyone else that is using the site like a normal human. Some platforms even have web scraping defensive mechanisms to stop bots from scraping data off their site.
Unfortunately, I couldn’t find any API on book prices that I could use (I tried Amazon and AbeBooks). This leads me to the other method — Web scraping.
Fetching book prices — Web scraping
There are many paid web scraping services if I couldn’t land an API and yet wanted a consistent source of data to update my book list. Of course for a quick weekend side project, I wasn’t willing to fork out any money to utilise any paid web scraping service.
This leaves me to writing my own web scraper for this project. While this isn’t my first time writing a web scraper, writing web scrapers are always a pain. You have to figure out how the web elements of the site work, and test the web scraper’s interactions with the site if the site uses JavaScript. Some times, and web elements (xpath / css selectors) may not be consistent across different products or are different aspects of the site.
If this eventually becomes a longer term project, the site’s user interfaces may change as well, which means the scraper has to be updated accordingly. As mentioned, some sites even have active web scraping blockers, with mechanisms like Re-captcha to block bot behaviour. In short, web scrapers troublesome to write and maintain, and are always my tool of last resort.
Unfortunately, this project seems to need me to go down that rabbit hole of writing my own web scraper. So after some quick searching, I chose to investigate Amazon further as a potential site for me to scrape, solely based on its wide range of books that are available on its website. Some sites like Book Depository didn’t have data on the more obscure books that I have read. amazon.sg also gave me book prices in Singapore dollars, which was what I wanted. Some quick testing of my scraper also suggested that Amazon had no direct web scraping blockers, as I was able to give the data for my first few titles easily. I did include time.sleep() to reduce the speed of my scraper. At least I didn’t hit any speed roadblocks from the site.
Eventually, I decided that to allow my project to move on, I had to write my own small little scraper after all.
Cleaning Up After My Scraper
As expected, my web scraper unfortunately presented a number of problems, although some of these problems were not expected by me to begin with. Firstly, data quality — I didn’t accurately record some of my book titles. It wasn’t like I had the intention of doing this project years ago. This meant some of my scraped results didn’t match the book titles that I fed my scraper with.
To overcome this, I did a text similarity comparison between the my title input and title scraped, and filtered for comparisons that gave low scores. Unfortunately, this still required me to spend quite a bit of time reviewing the scores, many of which were actually low, but still reflected the correct books after all.
Secondly, Amazon also periodically gave me weird results. For example, when my bot was searching for this book, “The Emperor’s Handbook”, I was fed the SGD 0 value for the audiobook, even though I would rather take the hardcover price instead. However, to my knowledge, only this book faced this issue, and I wasn’t sure why.
One reason why Amazon knew I was scraping their site through my digital signature and was purposefully obfuscating my data collection process. This was because when I did the same search for the same book the next day (to print an example screenshot) , Amazon gave me some really reasonable results. Nonetheless, this is just a hunch, and I didn’t want to spend too much time investigating on this. I decided to just drop this data point from my original input list, and move on.
In the middle of my project, I was thinking if I could get more consistent results by adding “paperback” to all my book searches on Amazon, so that I would be getting the paperbacks prices of all my books. It sounded ingenious when I was think about it, and my manual random sampling threw up very promising results. However, when I tried to scale this into my web scraper for all books, certain titles returned prices from the book “The Boy in the Striped Pajamas”.
Although I thought of a few remedies to this issue, all of them felt too ad hoc, and I decided drop the term “paperback” from my web scraper.
Finally, the Analysis!
Eventually, after 3 days of hard work, I finally fetched the retail prices of 230 books, and they added up to SGD 7,182.50, or an average of SGD 31.23 per book. This also translates to an average of SGD 652.95 worth of books per year from my NLB reads, from since 2011!
Below is an interactive graph created using Plotly and uploaded to Datapane to embed into this post.
Some thoughts
- Looking at the prices of some of these books, I question the accuracy of my analysis too! Nonetheless, I do have a ball park figure now with my POC. And this could potentially be used to spark some conversations with my stakeholders on the possible next steps. If my stakeholders do find this project valuable, we can then brainstorm ways to improve my project. As the saying goes in tech, “Done is Better Than Perfect”.
- Improving my analysis requires improving my data quality, which means improving my data collection process. Firstly, I would choose to use an API over a web scraper if possible. But since such a set up takes time, this quick project like this can help validate if I should invest any more effort into it. If my stakeholder is interested to push the project into a longer term service, I will explain to him / her that I need more time to make my service more stable. I wouldn’t mind applying and setting up a premium seller account to get an Amazon or AbeBooks API. If that isn’t feasible, I would consider paying for a premium web scraper service, just for that extra level of tech support to maintain the code. And if some financial costs is involved, I may want to ensure that I have a revenue model that cover my web hosting and scraping costs.
- Domain knowledge is important. I didn’t go into whether my books prices were on hardcover, paperback, audio books. Nor did I consider any seasonalities in book prices. And I didn’t cared about prices of different book editions, if there was any. It just felt too much work for a weekend side project.
- I would love to have more data about my book list. For example, if I noted when I read those books, I could run analysis across time periods, and see if there were any seasonalities in my data. It would be also interesting to add things like book genre to see the type of books that I have read, and much do these genres differ in costs. Waiting to have access to some API that gives book classifications.
- The entire project took ~3 days, from conceptualisation, writing the code and doing up the analysis and Medium post. Writing this post and adding some light code documentation took a day itself! A huge bulk of the work (~2 days) went into checking and cleaning the scraped data. And despite my best efforts, I still had to remove some data points that were obviously erroneous. However, as I wanted to communicate my project’s workflow and findings to a more non-technical audience, I knew I had to invest some time into writing everything out in as clear prose as possible.
Oh, and Yes, my Python Code!
Here is my Github repo. I littered my code with comments and random muse, so those keen to know my actual code thought processes can go there.
Final Words
The aim of this post was to try to bring more technical context to non-technical managers and stakeholders, so that they can understand the pains and issues their technical counterparts face. I believe we can have better societies if people are willing to understand other parties a little better. I only hope to do my best to bridge the gaps between technical and non-technical people, data and non-data people.
Thanks again for reading all the way to the end if you managed to reach here! I am happy to listen to any feedback, comments, or questions that anyone wants to share with me.