Good Data Analysis  |  ML Universal Guides  |  Google Developers - Deepstash
Machine Learning With Google

Learn more about artificialintelligence with this collection

Understanding machine learning models

Improving data analysis and decision-making

How Google uses logic in machine learning

Machine Learning With Google

Discover 95 similar ideas in

It takes just

14 mins to read

New Frontiers In Data Analysis

Not only do we typically work with very large data sets, but those data sets are extremely rich. That is, each row of data typically has many, many attributes. When you combine this with the temporal sequences of events for a given user, there are an enormous number of ways of looking at the data.

Contrast this with a typical academic psychology experiment where it's trivial for the researcher to look at every single data point. The problems posed by our large, high-dimensional data sets are very different from those encountered throughout most of the history of scientific work.

111

1.27K reads

Data Distributions

Most practitioners use summary metrics (for example, mean, median, standard deviation, and so on) to communicate about distributions.

However, you should usually examine much richer distribution representations by generating histograms, cumulative distribution functions (CDFs), Quantile-Quantile (Q-Q) plots, and so on. These richer representations allow you to detect important features of the data, such as multimodal behavior or a significant class of outliers.

129

929 reads

Consider The Outliers

Examine outliers carefully because they can be canaries in the coal mine that indicate more fundamental problems with your analysis.

It's fine to exclude outliers from your data or to lump them together into an "unusual" category, but you should make sure that you know why data ended up in that category.

116

786 reads

Look At The Noise

Randomness exists and will fool us. Some people think, “Google has so much data; the noise goes away.” This simply isn’t true. Every number or summary of data that you produce should have an accompanying notion of your confidence in this estimate (through measures such as confidence intervals and p-values).

112

730 reads

Look At Examples

Anytime you are producing new analysis code, you need to look at examples from the underlying data and how your code is interpreting those examples. Your analysis is abstracting away many details from the underlying data to produce useful summaries.

How you sample these examples is important:

  • If you are classifying the underlying data, look at examples belonging to each class.
  • If it's a bigger class, look at more samples.
  • If you are computing a number (for example, page load time), make sure that you look at extreme examples

111

634 reads

Slice Your Data

Slicing means separating your data into subgroups and looking at metric values for each subgroup separately. We commonly slice along dimensions like browser, locale, domain, device type, and so on. If the underlying phenomenon is likely to work differently across subgroups, you must slice the data to confirm whether that is indeed the case.

Even if you do not expect slicing to produce different results, looking at a few slices for internal consistency gives you greater confidence that you are measuring the right thing.

110

538 reads

Consider Practical Significance

With a large volume of data, it can be tempting to focus solely on statistical significance or to home in on the details of every bit of data. But you need to ask yourself, "Even if it is true that value X is 0.1% more than value Y, does it matter?"

This can be especially important if you are unable to understand/categorize part of your data. If you are unable to make sense of some user-agent strings in your logs, whether it represents 0.1% or 10% of the data makes a big difference in how much you should investigate those cases.

106

465 reads

Check For Consistency Over Time

You should almost always try slicing data by units of time because many disturbances to underlying data happen as our systems evolve over time. (We often use days, but other units of time may also be useful.)

During the initial launch of a feature or new data collection, practitioners often carefully check that everything is working as expected. However, many breakages or unexpected behavior can arise over time.

107

428 reads

Check For Variation

Looking at day-over-day data also gives you a sense of the variation in the data that would eventually lead to confidence intervals or claims of statistical significance. This should not generally replace rigorous confidence-interval calculation, but often with large changes you can see they will be statistically significant just from the day-over-day graphs.

105

406 reads

Acknowledge And Count Your Filtering

Almost every large data analysis starts by filtering data in various stages. Maybe you want to consider only US users, or web searches, or searches with ads. Whatever the case, you must:

  • Acknowledge and clearly specify what filtering you are doing.
  • Count the amount of data being filtered at each step.
  • Often the best way to do the latter is to compute all your metrics, even for the population you are excluding. You can then look at that data to answer questions like, "What fraction of queries did spam filtering remove?"

106

371 reads

Ratios Should Have Clear Numerators And Denominators

The most interesting metrics are ratios of underlying measures. Oftentimes, interesting filtering or other data choices are hidden in the precise definitions of the numerator and denominator. For example, which of the following does “Queries / User” actually mean?

  • Queries / Users with a Query
  • Queries / Users who visited Google today
  • Queries / Users with an active account (yes, I would have to define active)
  • Being really clear here can avoid confusion for yourself and others.

109

346 reads

Three Stages Of Data Analysis: Validation, Description, and Evaluation

Validation: Do I believe the data is self-consistent, that it was collected correctly, and that it represents what I think it does?

Description: What's the objective interpretation of this data? For example, "Users make fewer queries classified as X," "In the experiment group, the time between X and Y is 1% larger," and "Fewer users go to the next page of results."

Evaluation: Given the description, does the data tell us that something good is happening for the user, for Google, or for the world?

112

356 reads

Confirm Experiment And Data Collection Setup

Before looking at any data, make sure you understand the context in which the data was collected. If the data comes from an experiment, look at the configuration of the experiment. If it's from new client instrumentation, make sure you have at least a rough understanding of how the data is collected.

You may spot unusual/bad configurations or population restrictions (such as valid data only for Chrome). Anything notable here may help you build and verify theories later.

105

360 reads

Check For What Shouldn't Change

As part of the "Validation" stage, before actually answering the question you are interested in (for example, "Did adding a picture of a face increase or decrease clicks?"), rule out any other variability in the data that might affect the experiment. For example:

  • Did the number of users change?
  • Did the right number of affected queries show up in all my subgroups?
  • Did error rates change?

These questions are sensible both for experiment/control comparisons and when examining trends over time.

105

318 reads

Standard First, Custom Second

When looking at new features and new data, it's particularly tempting to jump right into the metrics that are new or special for this new feature. However, you should always look at standard metrics first, even if you expect them to change.

For example, when adding a new universal block to the page, make sure you understand the impact on standard metrics like “clicks on web results” before diving into the custom metrics about this new result.

105

299 reads

Measure Twice, Or More

Especially if you are trying to capture a new phenomenon, try to measure the same underlying thing in multiple ways. Then, determine whether these multiple measurements are consistent.

By using multiple measurements, you can identify bugs in measurement or logging code, unexpected features of the underlying data, or filtering steps that are important. It’s even better if you can use different data sources for the measurements.

105

309 reads

Check For Reproducibility

Both slicing and consistency over time are particular examples of checking for reproducibility. If a phenomenon is important and meaningful, you should see it across different user populations and time. But verifying reproducibility means more than performing these two checks. If you are building models of the data, you want those models to be stable across small perturbations in the underlying data.

If a model is not reproducible, you are probably not capturing something fundamental about the underlying process that produced the data.

105

288 reads

Check For Consistency With Past Measurements

Often you will be calculating a metric that is similar to things that have been counted in the past. You should compare your metrics to metrics reported in the past, even if these measurements are on different user populations.

You do not need to get an exact agreement, but you should be in the same ballpark. If you are not, assume that you are wrong until you can fully convince yourself. Most surprising data will turn out to be an error, not a fabulous new insight.

105

275 reads

New Metrics Should Be Applied To Old Data/Features First

If you create new metrics (possibly by gathering a novel data source) and try to learn something new, you won’t know if your new metric is right. With new metrics, you should first apply them to a known feature or data.

If you have a new metric for where users are directing their attention to the page, make sure it matches what we know from looking at eye-tracking or rater studies about how images affect page attention. Doing this provides validation when you then go to learn something new.

105

275 reads

Looking For Evidence

Typically, data analysis for a complex problem is iterative. You will discover anomalies, trends, or other features of the data. Naturally, you will develop theories to explain this data. Don’t just develop a theory and proclaim it to be true. Look for evidence (inside or outside the data) to confirm/deny this theory.

105

275 reads

A Story To Tell

Good data analysis will have a story to tell. To make sure it’s the right story, you need to tell the story to yourself, then look for evidence that it’s wrong. One way of doing this is to ask yourself, “What experiments would I run that would validate/invalidate the story I am telling?” Even if you don’t/can’t do these experiments, it may give you ideas on how to validate with the data that you do have.

108

276 reads

Exploratory Analysis Benefits From End-To-End Iteration

When doing exploratory analysis, perform as many iterations of the whole analysis as possible. Typically you will have multiple steps of signal gathering, processing, modeling, etc. If you spend too long getting the very first stage of your initial signals perfect, you are missing out on opportunities to do more iterations in the same amount of time.

Further, when you finally look at your data at the end, you may make discoveries that change your direction. Therefore, your initial focus should not be on perfection but on getting something reasonable all the way through.

106

267 reads

Watch Out For Feedback

We typically define various metrics around user success.

You can not use the metric that is fed back to your system as a basis for evaluating your change. If you show more ads that get more clicks, you can not use “more clicks” as a basis for deciding that users are happier, even though “more clicks” often means “happier.” Further, you should not even do slicing on the variables that you fed back and manipulated, as that will result in mixed shifts that will be difficult or impossible to understand.

106

255 reads

Skeptic And Champion At One Go

  • As you work with data, you must become both the champion of the insights you are gaining and a skeptic of them. You will hopefully find some interesting phenomena in the data you look at. When you detect an interesting phenomenon, ask yourself the following questions:
  • What other data could I gather to show how awesome this is?
  • What could I find that would invalidate this?

105

259 reads

Data Analysis Starts With Questions, Not Data Or A Technique

There’s always a motivation to analyze data. Formulating your needs as questions or hypotheses helps ensure that you are gathering the data you should be gathering and that you are thinking about the possible gaps in the data. Of course, the questions you ask should evolve as you look at the data. However, analysis without a question will end up aimless.

Avoid the trap of finding some favourite technique and then only finding the parts of problems that this technique works on. Again, creating clear questions will help you avoid this trap.

107

276 reads

Correlation and Causation

When making theories about data, we often want to assert that "X causes Y"—for example, "the page getting slower caused users to click less." You can not simply establish causation because of correlation. By considering how you would validate a theory of causation, you can usually develop a good sense of how credible a causal theory is.

Sometimes, people try to hold on to a correlation as meaningful by asserting that even if there is no causal relationship between A and B, there must be something underlying the coincidence so that one signal can be a good indicator or proxy for the other

105

278 reads

Share With Peers First, External Consumers Second

The previous points suggested some ways to get yourself to do the right kinds of soundness checking and validation. But sharing with a peer is one of the best ways to force yourself to do all these things. A skilled peer can provide qualitatively different feedback than the consumers of your data can. Peers are useful at multiple points through the analysis.

Early on you can find out about gotchas your peer knows about, suggestions for things to measure, and past research in this area. Near the end, peers are very good at pointing out oddities, inconsistencies, or other confusions.

106

293 reads

CURATED BY

anty

I’ve got 99 problems and I’m not dealing with any of them.

Read & Learn

20x Faster

without
deepstash

with
deepstash

with

deepstash

Access to 200,000+ ideas

Access to the mobile app

Unlimited idea saving & library

Unlimited history

Unlimited listening to ideas

Downloading & offline access

Personalized recommendations

Supercharge your mind with one idea per day

Enter your email and spend 1 minute every day to learn something new.

Email

I agree to receive email updates