Python design pattern for writing scalable data-wrangling pipelines

Some of the beauties of Python are its flexibility and simplicity. However, these abilities are a double-edged sword. If you do not put the work in early on, to design reusable, manageable, and testable code, you will run into progress issues as your codebase scales.

When using Python’s Pandas module, it’s easy to move away from an object-oriented style of coding. A common pitfall is to write quick code, that becomes hard to test and messy to scale.

This post shows a design pattern for reusable and low-maintenance code when data-wrangling with Pandas.

🪖 Objectives of Post

  1. Build a metrics pipeline with Pandas
  2. Refactor pipeline to be easily extendable and testable

 

🐍Codebase found here.

🦄 Dataset

We’ll be using a free data set from Kaggle containing: “A complete list of unicorn companies in the world.”

 

 

image

 

 

Context

“A unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.”

 

📊Kaggle Unicorn Companies dataset found here.

🧱 Build Metrics Pipeline

We will build 3 tables of Unicorn statistics using Pandas.

  1. Country-level metrics
  2. Country-level time series metrics
  3. Investor metrics

🏆 Winning Combination of Pandas Methods

We will be using a combination of Pandas methods that breeze data manipulation and keep our pipeline looking clean.

  • pandas.DataFrame.explode or pandas.Dataframe.melt
  • pandas.DataFrame.groupby
  • pandas.DataFrame.reset_index
  • pandas.merge

🌍 Country Level Stats

Firstly, we will calculate the count of Unicorns per country and the average Unicorn valuation (in billions) per country.

     country  company  mean_valuations_billion_usd  max_valuations_billion_usd
0  Argentina        1                     2.450000                        2.45
1  Australia        6                     8.433333                       40.00
2    Austria        2                     3.805000                        4.11
3    Bahamas        1                    32.000000                       32.00
4    Belgium        3                     2.983333                        5.25

output

📈 Example Plot using country_stats Output Table

 

image

 

The plot below is one example of a plot made with our country_stats table. We can quickly see the US leading the world in the total number of Unicorn companies.

 

⏳ Country Level Time Series

For these metrics, we group by country and date_joined columns, to count the number of Unicorns over time and sum the valuations.

💡

Note I previously sorted the dataframe by date_joined.

		 country founded_year  company  valuation_billion_usd
0  Argentina   2017-01-01        1                   2.45
1  Australia   2004-01-01        1                   1.60
2  Australia   2011-01-01        2                   2.50
3  Australia   2012-01-01        1                  40.00
4  Australia   2015-01-01        2                   6.50

output

➕ Cumulative Time Series

So far we have only generated time-series metrics at the point in time. However, it’s easier on the eye, to view the cumulative sum over time.

These steps take the generated time_series table and use an expanding window to calculate a cumulative sum.

 

		 country founded_year  company  valuation_billion_usd  company_cumsum  valuation_cumsum 
0  Argentina   2017-01-01        1                   2.45             1.0              2.45   
1  Australia   2004-01-01        1                   1.60             1.0              1.60 
2  Australia   2011-01-01        2                   2.50             3.0              4.10 
3  Australia   2012-01-01        1                  40.00             4.0              44.10 
4  Australia   2015-01-01        2                   6.50             6.0              50.60            

output

📈 Example Plot Using time_series Output Table

The time series plot below is made with our cumulative results, for the number of Unicorns per country. We can see since 2020–2021 the US has reached a trajectory for producing Unicorns that could not be matched by China. Whilst, the India and UK may be just beginning their growth stages.

 

 

image

 

🧑‍💼 Investor Stats

Generating investor metrics is more complex. Each company’s investors are stored as a comma-separated string.

For example, “Avant” have select_investors “RRE Ventures, Tiger Global, August Capital”.

We want to reuse the same code format, as with the country-level metrics, to make use of the pandas.DataFrame.groupby method. This will help us refactor later on.

 

	company                      select_investor
215 Otto Bock HealthCare         EQT Partners
390 Avant                        RRE Ventures, Tiger Global, August Capital  
520 Promasidor Holdings          IFC, Ajinomoto 
609 Five Star Business Finance   Sequoia Capital India, Tiger Global Management...
774 Radius Payment Solutions     Inflexion Private Equity                         

example output

🏗 Un-pivoting Investors

Un-pivoting is key to this design pattern, as we want to make use of the groupbymethod on individual investors.

Using pandas.DataFrame.explode, we generate an additional column for individual investors. Note that we now have multiple rows per company in our table.

 

💡Here I have use explode to un-pivot. Another method to check out is melt.

 

	 company select_investors_single  
0  Bytedance   Sequoia Capital China
1  Bytedance    SIG Asia Investments
2  Bytedance              Sina Weibo
3  Bytedance          Softbank Group  
4     SpaceX           Founders Fund  

output

 

The next step is to generate simple investor stats, company count, and valuation of the companies, in each investor’s portfolio.

📈 Example Plot using investors_stats

The histogram below shows the distribution of investors by the total number of Unicorns in their portfolio. We see a power-law type distribution where most investors have only 1 unicorn, whilst few have invested in many. This type of distribution can also be found in populational monetary wealth and social networks.

 

🔧 Pipeline so far

So far we have a metrics pipeline that looks fairly neat, but we are only generating a total of 8 metrics. If we were to extend this to 20–30 metrics, our script would start to see a lot of repetition.

The format of our code up to now is a simple python script. Thus our code cannot be isolated and unit-tested.

Our only testing option is to run the entire script and assess the output, in an end-to-end style test. This is not great as it could take a long time to run.

🚧 Pipeline refactor

🚪Open-close Principle

 

“Open for extension but closed for modification.”

 

Let's refactor our code to follow the open-close principle as best as possible.

 

  1. Move metrics functions to a Metrics class and enlist the use pandas.DataFrame.apply method.
  2. Remove the repeat calls to pandas.DataFrame.groupby, with a generate_metrics function, using Python’s built-in getattr function.
  3. Create a metrics config file, which is passed to our generate_metricsfunction, with the meta-data required to generate our metrics.

 

💡Check out SOLID design principles here.

🧑‍🎓 Metrics Class

By shifting our metrics to a class, we can isolate the metrics, and build unit tests for each metric. By using the Pandas.DataFrame.apply method, we can add personalized metrics, and leverage other python packages that are not included in Pandas.

 

Following the open-close principle, if we wanted to add metrics we would create a new class that inherits our class Metrics.

📂 Config File

The config file has a list of metrics for each table we want to generate. If we want to add or remove metrics or change naming etc, we simply change the config file. This way we are not editing our codebase itself.

⚙️ Generate Metrics Functions

This function takes in our Unicorn data, an instance of our Metrics class, and our metrics config, and returns a metrics data frame.

Steps:

  1. Uses getattr to create a pandas.DataFrame.groupby.apply object
  2. Uses getattr to create a Metrics class method object (e.g. Metrics.count)
  3. Calls the pandas.DataFrame.groupby.apply an object passing the Metrics method object

🪄 Refactored Pipeline

Finally, we arrive at our refactored pipeline. We can easily add metrics to existing tables by defining new metrics classes and adding them to our config file.

✅ 🔀 Benefits and Tradeoffs

Benefits

  • The code is modular and more manageable for scaling up our metrics.
  • A config file gives us more flexibility to add and takeout metrics without touching co
  • Easier to test the functionality of our code as it has been isolated.

Tradeoffs

  • More code to maintain for a small pipeline.
  • The readability of our code has been reduced.
  • Harder to debug for a small pipeline.

📚 Resources

🐍 Codebase found here.

📊 Kaggle Unicorn Companies dataset found here.

Third-party libraries:

  • pandas.DataFrame.explode
  • pandas.DataFrame.groupby
  • pandas.DataFrame.reset_index
  • pandas.merge
  • Python’s built-in getattr

#physics #nasa #space #facts #universe #knowledge #dailyfacts #biology #factz #chemistry #astronomy #education #earth #memes #cosmos #amazing #nature #allfacts #tech #innovation #astrophysics #code #ui #html #design #programming #userinterface #appdesign #digitaldesign #dribbble #uxdesign #webdesign #websitedesign #developer #celafarai #frasedelgiorno #webdeveloper #graphicdesign #marketing #instagram #dev #aforisma #frasi #softwareengineering #pensierodelgiorno #frontend #programmer #radio #pensiero #wireframe #clubdelcomic
#userinterface #appdesign #digitaldesign #dribbble #uxdesign #candivit #demetozdemir #canem