Iterations for my NLB scraper (Github code provided)

Cliff Chew
7 min readOct 17, 2021

--

Photo by Fahrul Azmi on Unsplash

Been a long while since I wrote anything here, as I had a few things to attend to in my life. I did manage to make some improvements on my NLB flow that I felt was interesting enough to share and large enough to have a separate post. The original post is here, where I wrote about my original user problem from being a user of the NLB app, and the approach that I used to solve the challenge that I faced.

In my original post, I mentioned a few things lacking in my original solution. Firstly, as I tend to bookmark quite a lot of books (current count is slightly under 70 books), it takes my scraper a good 30 minutes to capture all the data. This was because I have to slow down my scraper to ensure all the information I need is loaded before I can scrap them and move on to the next book.

The slow speed was still workable (I could run the script in the background while I do other work), but I realised the slow speed didn’t encourage me to run the scraper and update the information as often. This led to suboptimal situations where I was heading to a library, but I didn’t update my scraper.

Improved scraper stability and code maintainability

The first change I made was when I found a different set of URLs that I could scrape that were more stable, required less interactions with the site, and provided more information about the books. From a speed perspective, changing the URLs required me to use longer time.sleep() durations, while the original script had shorter time.sleep() durations, but more of them littered throughout my code due to the multiple interactions that my original code needed to perform. In the end, the new code didn’t feel obviously faster (I didn’t run both code to check, because both code felt like they took 30 minutes each, so a consistent check on both of them would take me an hour). However, the less Selenium interactions that I now need meant a leaner codebase that was easier to debug and maintain! I felt the reduction in code complexity was already a valid reason for my code migration. I also cannot deny that there was some sunk costs fallacy from not wanting to throw away code that I just wrote.

Improved scraper’s speed

Secondly, I finally decided to add concurrency to my scraper. Not coming from a CS background, multi-threading and concurrency techniques have always puzzled me. It took me a lot of Googling (about a few hours of a night) and tinkering before I finally got a “workable” concurrency flow that makes multiple calls to get my bookmarked books. The concurrency approach improved my scraper’s speed, bringing it down to around 10 minutes when I use 3 workers (Any more would cause the flow to be wonky).

There are still a few bugs that I still couldn’t figure out with my concurrency code. Firstly, on Jupyter notebooks, once the concurrency cells are done, the notebook gets stuck in that cell. The code doesn’t progress to the other cells, but the program doesn’t get killed either. It just seems to be stuck in some limbo. When I first ran the script, I was totally caught off guard (I left the code to run in the background while I was coding other stuff), because the code seemed to both work and not work. Now, when I run this script, I still have to execute the entire notebook, let it “finish” it’s concurrency portion, then I will kill off the script, and re-run the parts after the cell. The script works, although some human intervention is needed. I suspect this problem may be resolved if I run the Python scripts in a Python file instead of a notebook, but that is some tech debt that I can deal with for now. My script technically “works”!

Streamlit — A more workable frontend for on-the-go

The other pain point from my NLB flow was having the data in Google sheets made it difficult to navigate the information on my phone when I am at the library looking for the books. Basically, I needed a better frontend, and this is where Streamlit came to my rescue!

I had a dead project where I was serving the data in a table form using Streamlit, and I refactored the code from that project to serve my NLB data. That is why the project is called “j_learn”, but I shall not further explain what my dead project was. By including a menu button to filter the table by libraries, and a title keyword search bar, I was able to have an output table that showed the books available in the library and based on a particular title keyword that I was searching for. Piecing up the web app together took less than two hours!

As someone who is barely proficient with frontend code and JavaScript, Streamlit is really a godsend! For those interested, Streamlit app recently got their latest funding round, and also officially launched their version 1.0. This isn’t a sponsored article (I don’t think any company will sponsor someone who has less than 50 reads on average per article). I am a huge fan of Streamlit, and I feel that it can benefit a lot of non-frontend Python coders who need to build basic, usable frontends with their Python skill!

This isn’t a perfect web app

My improvements gave me a faster and more stable scraper, and also a more usable web app to show my bookmarked books on the go. For a couple of hours of Python work, I was quite pleased with the results I got.

But this is far from a perfect app. The information shown on Streamlit still isn’t real time. I cannot press a button on my phone to update the information, as this would need me to do some scraping on my phone (didn’t think this was feasible). Neither could I figure out any backend API call that I could use (Even if I could, I wouldn’t want to). This lag means that someone could potentially borrow the book that I want between the time I ran my web scraper and the time that I was at the library searching for the book, leaving me searching for a non-existent book in the library.

One possible solution to this was to have my web app serve each book with its corresponding NLB URL. So if my Streamlit app says that the book is available (my Streamlit app will only show books that are available), but I can’t find it in the library, I can click on the link to get the latest information of the book’s most up-to-date availability.

This will need me to capture the weblink data that I don’t yet store. Some frontend work on Streamlit will also be needed to surface each book with its appropriate URL link. Potentially a new feature that I want to include.

Concluding statements

Originally, I had a bad user experience with an app. I wanted to easily source the books from the public libraries in Singapore, but I couldn’t. Hence, I pieced together some scripts to help solve a user problem that I was facing. While I shared my code online, I decided to not go with a line-by-line explanation of my code. Those keen can just read through my Github. Any questions or suggestions are welcome (I am NOT a CS major), but my response may be slow as I am running a few other projects.

I felt that it was more important to share my thought process and how I went about solving a problem for myself. I am not sure if I used the best approach, but I tried to adopt a lean product management approach by focusing on the user problem, quickly getting a workable prototype to the users, and iterating the product as I get more feedback from the user. For those confused, the stated user here is myself. I don’t have the intention to monetise this “product”. I just wanted to solve my own problem. So I didn’t go with the whole “product-market-fit” analysis.

I do like my own product, and am particularly happy that it solves a huge pain point for me, as I frequent the public libraries on a weekly basis. But I am mindful that everyone overly loves their own babies though.

Lastly, to NLB, if any of you guys stumble onto this, you guys are doing a great job. I suspect I could be your number one fan! For example, I really love the app feature that allows us to borrow NLB books using our phone cameras. I really hope you guys can provide a feature that allows us to filter our bookmarked books based on library locations. If you guys can provide such a location search filter on the NLB app, I wouldn’t need to be scraping you guys anymore.

I really love you NLB guys and gals!!!

Links for more content

  1. [Github] Updated NLB web scraping script
  2. [Github] Streamlit for NLB web app
  3. [Article] Streamlit Series B funding round
  4. [Article] Streamlit version 1.0
  5. [Medium Post] Original post on NLB web scraping flow

--

--

Cliff Chew
Cliff Chew

Written by Cliff Chew

A person who thinks too much and writes too little

No responses yet