Amazon ML first impressions

I’m certainly not the only person blogging like this; see, for example, Julien Simon’s test drive.

About two weeks ago,1 the Amazon Machine Learning service was launched. The official blog post announcing the launch contains what I think is the core premise of the product:

You can benefit from machine learning even if you don’t have an advanced degree in statistics or the desire to setup, run, and maintain your own processing and storage infrastructure.

In other words, it aims to allow a team to do data science without a data scientist. For good results, domain knowledge, statistical awareness, and software engineering ability still all must be present collectively, but they don’t need to be possessed by one developer. In the degenerate \(n = 1\) case of an individual building a product, it means that these facilities don’t need to be exercised simultaneously.

The description of feature selection and feature extraction for the service supports this interpretation:

You should plan to spend some time enriching your data in order to ensure that it is a good match for the training process. As a simple example, you might start out with location data that is based on zip or postal codes. After some analysis, you could very well discover that you can improve the quality of the results by using a different location representation that contains greater or lesser resolution. The ML training process is iterative; you should definitely plan to spend some time understanding and evaluating your initial results and then using them to enrich your data.

While this does not strictly require an “advanced degree in statistics”, it’s difficult to do this with confidence and facility without the exposure to the rules and tricks and folklore conveyed through such training. Where the advanced training might be most redundant is in cases where computational limits come into play. There can be situations where variable selection may not play an important role in reducing model overfit but might be essential in reducing the data so that it can fit into main memory. One of the promises of ML as a service is that this can transparently be solved at the the cost of cents of extra compute rather than through person-hours of cleverness.

Test driving Amazon Machine Learning

Since I’m already playing in the Amazon stack and am generally interested in logistic regression, I decided to work through the service’s tutorial and make some first-hand observations. The tutorial works through the following problem:

Our sample exercise in the tutorial shows how to identify potential customers for targeted marketing campaigns, but you can apply the same principles to create and use a variety of machine learning models. […] This dataset contains information about customers as well as descriptions of their behavior in response to previous marketing contacts. You use this data to identify which customers are most likely to subscribe to your new product.

The implications of this example being chosen as opposed to something classical like using the iris dataset for multiclass classification of flowers as Iris setosa, Iris versicolor,2 or Iris virginica given various sepal and petal measurements are clear.

My impressions are a bit scattered and are roughly divided into those based on reading documentation and those based on working through the example. I don’t aim for this to stand completely on its own, as there’s no point in duplicating anything that is said (better) in the official blog post.

Background impressions

The tutorial skips through most of the data-munging stage of analysis. Even details of what the end product of the munging should be are just covered in the Creating and Using Datasources section of the Developer Guide.

A datasource can be understood to be similar to an R data frame but a bit richer. The easiest way to make one is from input data in the form of a CSV3 file, where each line is an observation with values for each of a fixed set of attributes. Alternatively, they can be created directly from RDS MySQL or Amazon Redshift in big data situations. Datasources are potentially leaky abstractions, since they don’t store a copy of the data.

The observations are coupled with a schema that organizes information about what sort of data the attributes represent. Of primary importance is the target attribute, which is the what the ML algorithm is trying to learn.

The observation size is bounded by 10MB, so if one were foolish enough to use this on raw image data, it would work but it would be a tight fit. This poses no problem for the bag-of-words representations used for NLP tasks.

Not much is done with missing values except that they trigger ignoring the entire observation in which they are found. There is also some subtlety in how they are only permitted for numeric attributes; for categorical attributes they can be represented by adding an additional category like “unknown”. The documentation suggest that missing values “should be corrected, if possible” but doing this statistically seems like a core task in commoditized ML, which can be accomplished using something like multiple imputation.

Practical impressions

To get an idea of the scale of the example, the imput files are banking.csv (4.7 MB, 41,188 observations) and banking-batch.csv (469 KB, 4,119 observations), which have been adapted from here. The observations have 20 predictor attributes and a binary target attribute. Data munging included translating the target attribute from no/yes to 0/1, and this would have to be done programmatically or in a spreadsheet application. Uploading these files to S3 took maybe 10 seconds, and is not the bottleneck in this workflow. Processing of the banking.csv file into a datasource took an additional number of minutes.

The automatic recognition of attribute types worked well, although the sample values for the numeric pdays attribute of 999, 999, and 6 suggest something weird is going on.4

The default training and evaluation setting is documented as:

Select this option to use the recommended ML model parameters and to have an evaluation automatically created for you. Amazon ML will split your input data and use 70 percent for training (training datasource) and 30 percent for evaluating this ML model (evaluation datasource).

This seems like a very reasonable choice. Looking at the advanced settings before training the model reveals the use of \(L_2\) regularization with a “Mild” amount of \(\lambda = \text{1e-6}\). If I had not accepted the defaults, I would have had more choice over the recipe:

{
  "groups" : {
    "NUMERIC_VARS_QB_50" : "group('cons_conf_idx','emp_var_rate','cons_price_idx')",
    "NUMERIC_VARS_QB_20" : "group('pdays')",
    "NUMERIC_VARS_QB_500" : "group('age','campaign')",
    "NUMERIC_VARS_QB_10" : "group('duration','euribor3m','previous','nr_employed')"
  },
  "assignments" : { },
  "outputs" : [ "ALL_BINARY", "ALL_CATEGORICAL", "quantile_bin(NUMERIC_VARS_QB_50,50)", "quantile_bin(NUMERIC_VARS_QB_20,20)", "quantile_bin(NUMERIC_VARS_QB_500,500)", "quantile_bin(NUMERIC_VARS_QB_10,10)" ]
}

This explained what was earlier a mystery to me, why, upon creation of datasources, descriptive statistics on the attributes are computed and stored. It seems that these statistics are used to determine the binning of some attributes. In particular, the numerical pdays attribute with its special value of 999 is automagically binned into a categorical attribute.

Starting inference than required waiting the promised “a few minutes”5 for results, which would make this a bit annoying for exploratory data analysis.

Amazon ML has chosen AUC (“an industry-standard quality metric”) its default performance metric for binary classification, the task explored in the tutorial. In this context, the AUC, which ranges from 0.56 through 1 on models with increasingly good predictive performance, gives the probability of making the right prediction on new data (as long as it was generated from the same process as the training data!). This metric integrates over all the possible cut-offs that turn predicted probabilities of the positive outcome into 0/1 predictions. The services has nice, intuitive tools for choosing the cut-off, which is equivalent to putting a relative weighting on the costs of false positives and false negatives.

A deeper understanding of how inference works can be gained by looking at the log file (or by actually consulting the relevant documentation!). Without going into an in-depth analysis, model fitting appears to involve multiple stages, with “consolidate” actions starting at lines 533, 708 of the log for this particular run and particular data.

Once a model has been fit and a prediction cut-off has been chosen, new batch prediction7 also take a few minutes. The model can also be left running to enable real-time predictions, at a slightly greater cost.

Overall, prediction seems much less integrated than the other steps of the workflow. For batch predictions, the output is dumped as a gzipped CSV file into a specified S3 bucket. For real-time predictions, the output is returned through a web service. In either case, there is little ability to check that the predictor or target distributions for new observations separately or jointly are tracking those of the training data, which is required for the learned model to perform well.

Conclusions

I found the service very slick, easy to use, and reassuringly free of poor statistical choices that would raise my hackles. I recall some Twitter hot takes chastising it for daring to give a service that only offered logistic regression the label “machine learning”. Such criticism seems misguided. With a strong infrastructure in place, it seems like it would be a trivial matter to add new models or learning algorithms. And with built-in model assessment and comparison capabilities, to choice to use a new model can be made completely on the basis of performance or fitting time. If anything, more sophisticated models may appeal more to less sophisticated modelers.


  1. According to the API Reference, “[t]he current version of the Amazon ML API is 2014-12-12”. I’d imagine a lot of dogfooding had taken place before that to bring it to its present state, and between then and its public introduction.

  2. Sir Ronald Fisher would probably have favored spelling this as “Iris versicolour”.

  3. The CSV format, short for “comma-separated values”, was formally specified only relatively recently by RFC 4180 in 2005. Many existing datasets were laid down in its absence and (retroactively) in defiance of it. See, for example, the explanation of how the Super CSV package treats input that is de facto but not de jure CSV for insight into the implications of this.

  4. Brief detective reveals that this field gives the “number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)”. The best way to handle this is probably recoding into a categorical variable.

  5. Jeff Barr, who wrote the blog post announcing the service, narrated this as “I took a quick break to water my vegetable garden and returned to find that my model was ready to go”, which sounds about right.

  6. Formally, AUCs as low as 0 up through 0.5 can be achieved, but this requires finding predictors that do worse than flipping a (properly) biased coin. A predictor that does worse than chance can be turned into one that does better than chance by flipping its predictions!

  7. Jeff Barr’s narrative continues here. “After another quick trip to the garden, my predictions were ready!”.