Python design pattern for writing scalable data-wrangling pipelines
Some of the beauties of Python are its flexibility and simplicity. However, these abilities are a double-edged sword. If you do not put the work in early on, to design reusable, manageable, and testable code, you will run into progress issues as your codebase scales.
When using Python’s Pandas module, it’s easy to move away from an object-oriented style of coding. A common pitfall is to write quick code, that becomes hard to test and messy to scale.
This post shows a design pattern for reusable and low-maintenance code when data-wrangling with Pandas.
🪖 Objectives of Post
- Build a metrics pipeline with Pandas
- Refactor pipeline to be easily extendable and testable
🐍Codebase found here.
🦄 Dataset
We’ll be using a free data set from Kaggle containing: “A complete list of unicorn companies in the world.”
Context
“A unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.”
📊Kaggle Unicorn Companies dataset found here.
🧱 Build Metrics Pipeline
We will build 3 tables of Unicorn statistics using Pandas.
- Country-level metrics
- Country-level time series metrics
- Investor metrics
🏆 Winning Combination of Pandas Methods
We will be using a combination of Pandas methods that breeze data manipulation and keep our pipeline looking clean.
pandas.DataFrame.explode
orpandas.Dataframe.melt
pandas.DataFrame.groupby
pandas.DataFrame.reset_index
pandas.merge
🌍 Country Level Stats
Firstly, we will calculate the count of Unicorns per country and the average Unicorn valuation (in billions) per country.
country company mean_valuations_billion_usd max_valuations_billion_usd
0 Argentina 1 2.450000 2.45
1 Australia 6 8.433333 40.00
2 Austria 2 3.805000 4.11
3 Bahamas 1 32.000000 32.00
4 Belgium 3 2.983333 5.25
output
📈 Example Plot using country_stats Output Table
The plot below is one example of a plot made with our country_stats
table. We can quickly see the US leading the world in the total number of Unicorn companies.
⏳ Country Level Time Series
For these metrics, we group by country and date_joined columns, to count the number of Unicorns over time and sum the valuations.
💡
Note I previously sorted the dataframe by date_joined.
country founded_year company valuation_billion_usd
0 Argentina 2017-01-01 1 2.45
1 Australia 2004-01-01 1 1.60
2 Australia 2011-01-01 2 2.50
3 Australia 2012-01-01 1 40.00
4 Australia 2015-01-01 2 6.50
output
➕ Cumulative Time Series
So far we have only generated time-series metrics at the point in time. However, it’s easier on the eye, to view the cumulative sum over time.
These steps take the generated time_series
table and use an expanding window to calculate a cumulative sum.
country founded_year company valuation_billion_usd company_cumsum valuation_cumsum
0 Argentina 2017-01-01 1 2.45 1.0 2.45
1 Australia 2004-01-01 1 1.60 1.0 1.60
2 Australia 2011-01-01 2 2.50 3.0 4.10
3 Australia 2012-01-01 1 40.00 4.0 44.10
4 Australia 2015-01-01 2 6.50 6.0 50.60
output
📈 Example Plot Using time_series Output Table
The time series plot below is made with our cumulative results, for the number of Unicorns per country. We can see since 2020–2021 the US has reached a trajectory for producing Unicorns that could not be matched by China. Whilst, the India and UK may be just beginning their growth stages.
🧑💼 Investor Stats
Generating investor metrics is more complex. Each company’s investors are stored as a comma-separated string.
For example, “Avant” have select_investors “RRE Ventures, Tiger Global, August Capital”.
We want to reuse the same code format, as with the country-level metrics, to make use of the pandas.DataFrame.groupby
method. This will help us refactor later on.
company select_investor
215 Otto Bock HealthCare EQT Partners
390 Avant RRE Ventures, Tiger Global, August Capital
520 Promasidor Holdings IFC, Ajinomoto
609 Five Star Business Finance Sequoia Capital India, Tiger Global Management...
774 Radius Payment Solutions Inflexion Private Equity
example output
🏗 Un-pivoting Investors
Un-pivoting is key to this design pattern, as we want to make use of the groupby
method on individual investors.
Using pandas.DataFrame.explode
, we generate an additional column for individual investors. Note that we now have multiple rows per company in our table.
💡Here I have use explode
to un-pivot. Another method to check out is melt
.
company select_investors_single
0 Bytedance Sequoia Capital China
1 Bytedance SIG Asia Investments
2 Bytedance Sina Weibo
3 Bytedance Softbank Group
4 SpaceX Founders Fund
output
The next step is to generate simple investor stats, company count, and valuation of the companies, in each investor’s portfolio.
📈 Example Plot using investors_stats
The histogram below shows the distribution of investors by the total number of Unicorns in their portfolio. We see a power-law type distribution where most investors have only 1 unicorn, whilst few have invested in many. This type of distribution can also be found in populational monetary wealth and social networks.
🔧 Pipeline so far
So far we have a metrics pipeline that looks fairly neat, but we are only generating a total of 8 metrics. If we were to extend this to 20–30 metrics, our script would start to see a lot of repetition.
The format of our code up to now is a simple python script. Thus our code cannot be isolated and unit-tested.
Our only testing option is to run the entire script and assess the output, in an end-to-end style test. This is not great as it could take a long time to run.
🚧 Pipeline refactor
🚪Open-close Principle
“Open for extension but closed for modification.”
Let's refactor our code to follow the open-close principle as best as possible.
- Move metrics functions to a Metrics class and enlist the use
pandas.DataFrame.apply
method. - Remove the repeat calls to
pandas.DataFrame.groupby
, with agenerate_metrics
function, using Python’s built-ingetattr
function. - Create a metrics config file, which is passed to our
generate_metrics
function, with the meta-data required to generate our metrics.
💡Check out SOLID design principles here.
🧑🎓 Metrics Class
By shifting our metrics to a class, we can isolate the metrics, and build unit tests for each metric. By using the Pandas.DataFrame.apply method, we can add personalized metrics, and leverage other python packages that are not included in Pandas.
Following the open-close principle, if we wanted to add metrics we would create a new class that inherits our class Metrics.
📂 Config File
The config file has a list of metrics for each table we want to generate. If we want to add or remove metrics or change naming etc, we simply change the config file. This way we are not editing our codebase itself.
⚙️ Generate Metrics Functions
This function takes in our Unicorn data, an instance of our Metrics class, and our metrics config, and returns a metrics data frame.
Steps:
- Uses
getattr
to create apandas.DataFrame.groupby.apply
object - Uses
getattr
to create a Metrics class method object (e.g.Metrics.count
) - Calls the
pandas.DataFrame.groupby.apply
an object passing the Metrics method object
🪄 Refactored Pipeline
Finally, we arrive at our refactored pipeline. We can easily add metrics to existing tables by defining new metrics classes and adding them to our config file.
✅ 🔀 Benefits and Tradeoffs
Benefits
- The code is modular and more manageable for scaling up our metrics.
- A config file gives us more flexibility to add and takeout metrics without touching co
- Easier to test the functionality of our code as it has been isolated.
Tradeoffs
- More code to maintain for a small pipeline.
- The readability of our code has been reduced.
- Harder to debug for a small pipeline.
📚 Resources
🐍 Codebase found here.
📊 Kaggle Unicorn Companies dataset found here.
Third-party libraries:
pandas.DataFrame.explode
pandas.DataFrame.groupby
pandas.DataFrame.reset_index
pandas.merge
- Python’s built-in
getattr
#physics #nasa #space #facts #universe #knowledge #dailyfacts #biology #factz #chemistry #astronomy #education #earth #memes #cosmos #amazing #nature #allfacts #tech #innovation #astrophysics #code #ui #html #design #programming #userinterface #appdesign #digitaldesign #dribbble #uxdesign #webdesign #websitedesign #developer #celafarai #frasedelgiorno #webdeveloper #graphicdesign #marketing #instagram #dev #aforisma #frasi #softwareengineering #pensierodelgiorno #frontend #programmer #radio #pensiero #wireframe #clubdelcomic
#userinterface #appdesign #digitaldesign #dribbble #uxdesign #candivit #demetozdemir #canem