PIRATES

 1.0 Data Exploration and Feature Engineering

2.1.2-FEATURE-ENGINEERING-1.jpg

 1.1 Data Gathering

The moment we get our hands on the dataset, we first explore the dataset. This includes knowing about the number and types of features in the dataset, performing some descriptive statistics on the dataset etc.

 

1.2 Feature Engineering

Feature Engineering 1.jpg

This section elaborates on the “Given a set of volunteers and a current disaster” part of the problem statement.

Feature Engineering is a blanket term used for a bunch of activities performed on the features of the dataset making them more appropriate for their use in modeling. Efficient execution of any ML project is highly dependent on how key aspects of the problem statement are modeled and represented as an array of numbers (or vectors).

 

1.2.1 Feature Construction

Our first task is to find a number representation of volunteers and disasters which are representative without being redundant. Volunteers and disasters will be thought of as vectors, and each feature we pick will form a dimension of these vectors.

Feature Construction.jpg
 

Sample volunteer features

 

Feature

Type

Example values

Skills

Categorical

An array of: “Knows swimming”, “Certified in CPR” etc.

Distance travelled for previous assignments

Numerical

10 miles, 200 miles etc.

Number of disasters served in the past

Numerical

1, 10, 30 etc.

Disaster types served in the past

Categorical

An array of: “National”, “Regional”, “Fire related”, “Water related”, “Disease related” etc.

 

Sample disaster features

 

Feature

Type

Example values

Type

Categorical

“National”, “Regional”, “Fire related”, “Water related”, “Disease related” etc.

Number of volunteers assigned

Numerical

10, 25, 100 etc.

Date of occurrence

Numerical

1556419662, 1556418774 etc.

 

 

1.2.2 Feature Reduction

Feature Reduction 2.png

There may be many features which are representative, but of no consequence to our problem statement. For e.g: A volunteer’s primary language might be highly representative of a volunteer but of no consequence to our problem statement. Eliminating such features will make it easier for downstream algorithms to learn fewer weights and biases and hence make them a lot more efficient.

 

Can a feature be eliminated by combining two or more previously selected features? This is again to reduce the load on downstream algorithms. Many times we will choose to combine two (or more) features as one or break up one feature to multiple. But, care will be taken to never have redundancy in the features.

 

1.2.3 Feature Scaling

Does a feature have well defined classes (in case of categorical features) or well defined maximum and minimum values (in case of numerical features)?

All numerical features need to be scaled so that ML algorithms get a consistent view of all numerical features. For ex. A feature like “number of previous deployments” can range from 1 to 50 while a feature like “total distance travelled for an assignment”  might range from 10 to 1000 miles. All such features need to be scaled down to a value between 0 and 1 before feeding them to any ML algorithm.

Most categorical features will be involved in probability distributions and to calculate these probabilities properly we will need to know all possible values a feature can take on. For ex. If disaster type is a feature, we will need to know exactly how many types there are and what their values are.

 

1.3 Additional Data Gathering

For the features we have picked, we will have to think about the following:

  1. Which features do we already have data for?

  2. Which features can we gather data for?

  3. Which features can we not gather data for and hence have to drop it?

This is again a collaborative process between Pirates and ARC.

Data scaling.jpeg

Pirates will be helping out in tasks such as:

  1. Creating surveys to gather data from volunteers

  2. Scripts to anonymize collected data and save it in backend data stores

  3. Putting up simple web pages for people to enter in data collected offline

  4. Going through legacy databases to extract information relevant to current project