Using Python Polars — My (Late) 2024 Review
About a year ago, I heard about Polars, a new data processing Python package written in Rust that is much faster than Pandas. Back then, Polars still felt like it was in beta ( this was before version 1.x ), so I waited. Finally, I started using Polars a couple of weeks ago, and here are some reflections done on late Sept 2024
Different syntax
Polars feel both similar and different to Pandas. Polars definitely forces me to frame my processes differently from Pandas, and in some ways do slow me down for now. Polars does seem to want to explicitly write many things out, while I have seen some who find Panda’s syntax confusing. Switching from Pandas to Polars requires some syntax switching.
Faster runtime
I feel Polars code execution runtime is faster than Pandas. Maybe from a mix of being unfamiliar with Polars and its faster speed, I am doing more “unit tests” that re-runs my entire process on every change I make. For example, for a set of housing price scripts, I am always running “max resale price” after any code change to ensure that I am not adding unknown side effects to my Polar’s code. Because I am able to run faster with Polars, I am more willing to run more with Polars, and this may allow me to write more robust analytics code in general.
I am starting to think about splitting my data extraction and processing steps, especially since I make API calls to get the latest data for my work, and making unnecessary API calls do slow down my workflow. I am also starting to shift my working IDE from Jupyterlabs to Quarto-Neovim, which helps with my continued investments into vim-motions !
PS: I did have a use case where I was dealing with an API and performing some database backfill, and I decided to use Jupyter-Notebooks because it was just easier to check if my backfills were correct while keeping my connection state up.
More differences with Pandas
Polars cannot mix data types in a single column, while Pandas can. This may be a bad habit, but this allows simple logic treatments. For example, in Pandas, I can combine “lease_left” and “freehold” into a single column, thereby combining a numeric and string type together. This cannot be done in Polars, and it forces me to have to separate my filters for freehold and lease_left.
To my understanding, I also cannot use list comprehension in Polars to process columns. I fell in love them in Pandas. I felt list comprehensions were much more readable than lambda functions and .apply() methods.
Polars has been in rapid development up till version 1.4.0, so quite a bit of Stackoverflow information is outdated, and I have to constantly go to the Polars’ official documentation, which at times, isn’t as clear.
I just started using LazyFrames, so I am still evaluating its speed. However, LazyFrames definitely add more programming overheads to me, and its errors harder to debug and entire chunk of code. I am using chatgpt to optimise my Polar code, but this has also been a hit-and-miss experience so far. Knowing the right questions to ask is definitely important with chatgpt.
Installation issues
While I managed to solve this, I had some initial issues installing Polars on my MacBook M1 chip laptop. The solution was to install “polars-lts-cpu” instead. This took me quite a bit of time, and I almost gave up trying to use Polars.
My Situational Takes
- Would I use Polars ? Strong YES ! As mentioned, I am using it with my new workflow that uses Neovim-Quarto. The runtime speed is really amazing, and I cannot wait to deal with really large GB amounts of data when they come.
- Would I revamp a large Pandas codebase into Polars? Soft No, especially if Pandas has been working well. But if Pandas is reaching its limit for you, I can see why one would suggest to shift from Pandas to Polars. The migration may not be that intuitive, and it will take some time. There are options to convert Pandas dataframes to Polars dataframes, and back, so piecewise refactoring is possible. However, I am not sure if it makes sense to have two packages run concurrently in a production codebase.
- If I am a newbie, should I learn Pandas or Polars first? Depends! With tools like GenAI, I feel the importance of analytics code may diminish, but important concepts about creating impact from data will remain. I reckon there is still more analytics codebases in companies that exist in Pandas, so learning Pandas first may help you land a job ( very strong maybe, because I still feel SQL is the more important factor ). At the end of the day, I feel that Pandas and Polars are both useful tools in an analytics professional’s tool box, and knowing both of them, even at a rudimentary level, would be helpful in many ways.
Thanks to everyone who has read this post. If you are interested in analytics side projects with a social science spin, follow me on Medium or Linkedin. Some topics I have explored include (1) Singapore housing prices and the updated Singapore million dollar public home analysis, (2) accessibility of Singapore hotels, (3) Taiwan housing prices, and (4) I even built a small web app for Singaporeans to track the library books they want to borrow. I also share less technical topics, like (5) how I learned to deal with uncertainty and (6) how I end up being a freelance analytics consultant.
Lastly, I have a Substack (it is still alive) as well, where I share ideas on data concepts and strategies for targeted at busy business people.