Clearbit: Adding Machine Learning to Your Lead Qualification Process.
At Clearbit, we are always looking at ways to optimize our sales process via analytics. Recently, we have asked ourselves a question, “How could we predict which of our free customers will become paid ones?” That’s why we decided to enhance our sales process with Amazon Machine Learning (Amazon ML).

Amazon ML allows us to analyze our signups, and based on the historical data, determine which are most likely to become paying customers. This way, our revenue is more predictable and our sales team knows which leads to prioritize. Today, you will learn how to implement the same technique in your sales funnel.

Getting our data into Amazon ML

Amazon ML is great because it allows anyone to use machine learning technology without having to learn complex ML algorithms. You can create various ML models with guide wizards and obtain predictions using simple APIs. You still need to do a little bit of initial work, but if done correctly, the payoff can be much greater.

The first step we needed to do in order to answer our question was to get our current data into Amazon ML. At Clearbit, all of our analytics data runs through the Enrichment API integration with Segment. Thanks to this integration, we are able to append all of the person and company data to our customers.

We also use the Segment’s integration with Amazon Redshift to upload our data to the Amazon servers. If you do not have a Segment account, you can upload your data manually to an Amazon S3 bucket by exporting it in the .csv file. Our data contains the enriched customer information along with the list of won and lost deals - we will be trying to predict this column. 

Before we move on, we need to mention two things about your data. Machine Learning can at times feel like magic, but it is far from magic. In fact, it is only as good as the data you provide it. Therefore, it is crucial to have enough data to feed into your ML model, otherwise your results will be wildly inaccurate. Similarly, if your data is not of good quality, do not expect your results to be any better (GIGO).

Creating metadata

Once we have uploaded our data to the servers, we need to reference it correctly for the machine, so Amazon ML can perform its magic on it. We need to tell Amazon where is our data located, give the datasource we are creating a name, and allow Amazon to read our data.

After we gave Amazon the permission, we need to make sure that Amazon ML knows about the scheme of our data. You could have provided the schema file during the upload, or you can allow Amazon ML to infer the schema.

Now we need to mark the target attribute we will be trying to predict - in our case, whether a user became a paying customer. At this point, Amazon ML created for us two datasources from the existing information - one will be used for the training, and one for the evaluation of our algorithm.

Training the ML model

Now that we have the data uploaded and marked up correctly, we can start playing with it!

You can create ML models by going to the Amazon ML Console → Amazon Machine Learning → ML Models → Create a new ML model. To create a default model, we name it and simply point the input data to the datasource we created in the previous step. Amazon ML then adjusts its default settings depending on the type of data we selected.

In our case, we will be training the model to predict the value of paying_customer. This field can be either 0 or 1, so the machine automatically applied the Binary Classification Model. At the time of writing, Amazon ML offers two other classification models - Multiclass Classification Model and Regression Model.

If you need to create a custom model, you can set the training parameters yourself. Amazon does a pretty good job of guessing the default settings (like in our case), but sometimes you might want to have a human in charge. In that case, you can tweak these parameters:
  • Maximum model size
  • Maximum number of passes over training data
  • Shuffle type
  • Regularization type
  • Regularization amount
Describing each parameter would be beyond the scope of this article, but you can read more about training parameters in the Amazon ML documentation.

When you are happy with the model settings, you can click Review to review your settings, and then Finish. Your model will be subsequently placed into the processing queue. By default, the datasource will be split into two sections in the ratio 7:3. This means that 70% of the input data will be used for training and the remaining 30% to evaluate the model.

Evaluating the ML model

Before we decide to employ our model in the wild, we need to check its accuracy. We already trained it on part of the input data (70% by default). Now comes the test where we show it the remaining input data from the other datasource (30% by default).

When evaluating a model, it is important to know the correct answer in advance, and the model should have never seen the data before. This way, we can avoid incorrectly rewarding our model for its ability to memorize the input data - we want it to make predictions based on generalization. One thing to keep in mind is that both datasources need to have the same schema.

As our target value is of binary type, Amazon offers a standard evaluation method called Area Under a Curve (AUC). In short, the range of values in AUC can be from 0 to 1, with the default being 0.5. By default, our model wouldn’t perform better than a series of random guesses. Anything below 0.5 is actually worse than random guessing, so we want our model to be better than that.

The last thing we had to do was to select the cut-off point. This means that anything beyond the cut-off point will be predicted as 1 (paying customer) and anything below will be 0 (free tier). The choice of threshold depends heavily on your application as it influences the number of falsely identified input rows.

Taking action!

With the model up and running, we can start improving our lead qualification process! At Clearbit, we get about 150 signups every day. Out of these, our ML model identifies approximately 5–10 leads that are destined to become paying customers. We don’t want to leave everything up to destiny though, so we decided to set up a little process.

When our model identifies a new lead, we pass this information through Segment and Customer.io to prevent our process from sending an automated welcome email to them. Instead, we let our team know about these signups by sending a Slack notification to our sales channel as well as sending an internal email. Our sales team then personally reaches out to the qualified signups and tries to convert them into paying customers as soon as possible.