Python Bytes 167
Sponsored by Datadog: pythonbytes.fm/datadog
Special guest: Vicki Boykis: @vboykis
Michael #1: clize: Turn functions into command-line interfaces
- via Marcelo
- Follow up from Typer on episode 164.
- Features
- Create command-line interfaces by creating functions and passing them to clize.run.
- Enjoy a CLI automatically created from your functions’ parameters.
- Bring your users familiar --help messages generated from your docstrings.
- Reuse functionality across multiple commands using decorators.
- Extend Clize with new parameter behavior.
- I love how this is pure Python without its own API for the default case
Vicki #2: How to cheat at Kaggle AI contests
- Kaggle is a platform, now owned by Google, that allows data scientists to find data sets, learn data science, and participate in competitions
- Many people participate in Kaggle competitions to sharpen their data science/modeling skills
- Recently, a competition that was related to analyzing pet shelter data resulted in a huge controversy
- Petfinder.my is a platform that helps people find pets to rescue in Malaysia from shelters. In 2019, they announced a collaboration with Kaggle to create a machine learning predictor algorithm of which pets (worldwide) were more likely to be adopted based on the metadata of the descriptions on the site.
- The total prize offered was $25,000
- After several months, a contestant won. He was previously a Kaggle grandmaster, and won $10k.
- A volunteer, Benjamin Minixhofer, offered to put the algorithm in production, and when he did, he found that there was a huge discrepancy between first and second place
- Technical Aspects of the controversy:
- The data they gave asked the contestants to predict the speed at which a pet would be adopted, from 1-5, and included input features like type of animal, breed, coloration, whether the animal was vaccinated, and adoption fee
- The initial training set had 15k animals and the teams, after a couple months, were then given 4k animals that their algorithms had not seen before as a test of how accurate they were (common machine learning best practice).
- In a Jupyter notebook Kernel on Kaggle, Minixhofer explains how the winning team cheated
- First, they individually scraped Petfinder.my to find the answers for the 4k test data
- Using md5, they created a hash for each unique pet, and looked up the score for each hash from the external dataset - there were 3500 overlaps
- Did Pandas column manipulation to get at the hidden prediction variable for every 10th pet and replaces the prediction that should have been generated by the algorithm with the actual value
- Using mostly: obfuscated functions, Pandas, and dictionaries, as well as MD5 hashes
- Fallout:
- He was fired from H20.ai
- Kaggle issued an apology
Sponsor section:
Today’s episode is sponsored by Datadog, a cloud-scale monitoring platform that unifies metrics, logs, and traces. Monitor your Python applications in real time; pinpoint bottlenecks with detailed flame graphs; and trace requests as they travel across service boundaries. Plus, their tracing client auto-instruments popular frameworks like Django, asyncio, and Flask so you can quickly start visualizing the health and performance of your Python applications.
Get started today with a free 14-day trial and Datadog will send you a complimentary t-shirt.
pythonbytes.fm/datadog
Michael #3: Configuring uWSGI for Production Deployment
- We run a lot of uWSGI backed services. I’ve spoken in-depth back on Talk Python 215: The software powering Talk Python courses and podcast about this.
- This is guidance from Bloomberg Engineering’s Structured Products Applications group
- We chose uWSGI as our host because of its performance and feature set. But, while powerful, uWSGI’s defaults are driven by backward compatibility and are not ideal for new deployments.
- There is also an official Things to Know doc.
- Unbit, the developer of uWSGI, has “decided to fix all of the bad defaults (especially for the Python plugin) in the 2.1 branch.” The 2.1 branch is not released yet.