The Essentials of Machine Learning Data Curation - Deepstash
The Essentials of Machine Learning Data Curation

The Essentials of Machine Learning Data Curation

Curated from: artiba.org

Ideas, facts & insights covering these topics:

9 ideas

·

233 reads

Explore the World's Best Ideas

Join today and uncover 100+ curated journeys from 50+ topics. Unlock access to our mobile app with extensive features.

Start with Datasets

Start with Datasets

Data is the new oil - and just as oil needs the right refining to come into perfect usage, data too needs curing. The power of your machine learning models will greatly depend on the quality of your data.

7

62 reads

Mistake - Don’t prioritize data curation

Mistake - Don’t prioritize data curation

As AI integration across industries picks greater pace, ML engineers are confronted with a sad reality - once stakeholders identify a use case with proven ROI, they are eager to jump onto the AI ship, and data curation is not given its due importance. In a survey of 150 machine learning engineers at large companies, 41 percent say their data is too siloed.

7

22 reads

Types of Datasets for Machine Learning

ML engineers depend on data during each step of their AI journey – from model selection, training, and tuning to testing. These datasets usually fall under three categories:

  1. Training sets
  2. Testing sets
  3. Validation sets

7

27 reads

Training Data

Training Data

The training data set is used to train an algorithm, apply concepts, learn, and give results. Around 60 percent of data is training data.

9

23 reads

Testing Data

Testing Data

Testing data is used to test the validity of the training data set. Training data is not used for testing because it will produce the expected output. The testing data set comprises of 20 percent of the total data.

9

19 reads

Validation Data

Validation Data

Validation tests are used to identify and tune the ML model.

9

24 reads

How to start curating

The process of curating datasets for machine learning starts well before availing datasets. Here’s what we suggest:

  • Identify the goal of AI
  • Identify what dataset you will need to solve the problem
  • Make a record of your assumptions while selecting the data
  • Aim for collecting diverse and meaningful data from both external and internal resources
  • Build a dataset that is hard for your competitors to copy

8

17 reads

Small Dataset = use pre-trained model

If you have a small dataset, using a model pre-trained on large datasets can be a good idea. You can use your small dataset to fine-tune it.

7

20 reads

Steps

  1. Formatting: The data is spread in different formats. Formatting will bring it together in one sheet. For example, customer data can come with different currencies, languages, etc. These need to be compiled under one format.
  2. Labeling: Labeling is done to ensure the data set works for your model. For example, a self-driving car will need data labeled as pictures of cars, pedestrians, street signs, footpaths etc.
  3. Data Cleaning: Unwanted characters are removed and missing values are dealt with.
  4. Feature extraction: A number of features are analyzed and optimized. Features that are important for prediction are selected for quicker computation and less memory consumption.

8

19 reads

IDEAS CURATED BY

sabin

Building @deepstash

Sabin Hantu's ideas are part of this journey:

Machine Learning With Google

Learn more about computerscience with this collection

Understanding machine learning models

Improving data analysis and decision-making

How Google uses logic in machine learning

Related collections

Read & Learn

20x Faster

without
deepstash

with
deepstash

with

deepstash

Personalized microlearning

100+ Learning Journeys

Access to 200,000+ ideas

Access to the mobile app

Unlimited idea saving

Unlimited history

Unlimited listening to ideas

Downloading & offline access

Supercharge your mind with one idea per day

Enter your email and spend 1 minute every day to learn something new.

Email

I agree to receive email updates