1.0 Data Exploration and Feature Engineering

1.1 Data Gathering

The moment we get our hands on the dataset, we first explore the dataset. This includes knowing about the number and types of features in the dataset, performing some descriptive statistics on the dataset etc.

1.2 Feature Engineering

This section elaborates on the “Given a set of volunteers and a current disaster” part of the problem statement.

Feature Engineering is a blanket term used for a bunch of activities performed on the features of the dataset making them more appropriate for their use in modeling. Efficient execution of any ML project is highly dependent on how key aspects of the problem statement are modeled and represented as an array of numbers (or vectors).

1.2.1 Feature Construction

Our first task is to find a number representation of volunteers and disasters which are representative without being redundant. Volunteers and disasters will be thought of as vectors, and each feature we pick will form a dimension of these vectors.

Sample volunteer features

Feature	Type	Example values
Skills	Categorical	An array of: “Knows swimming”, “Certified in CPR” etc.
Distance travelled for previous assignments	Numerical	10 miles, 200 miles etc.
Number of disasters served in the past	Numerical	1, 10, 30 etc.
Disaster types served in the past	Categorical	An array of: “National”, “Regional”, “Fire related”, “Water related”, “Disease related” etc.

Sample disaster features

Feature	Type	Example values
Type	Categorical	“National”, “Regional”, “Fire related”, “Water related”, “Disease related” etc.
Number of volunteers assigned	Numerical	10, 25, 100 etc.
Date of occurrence	Numerical	1556419662, 1556418774 etc.

1.2.2 Feature Reduction

There may be many features which are representative, but of no consequence to our problem statement. For e.g: A volunteer’s primary language might be highly representative of a volunteer but of no consequence to our problem statement. Eliminating such features will make it easier for downstream algorithms to learn fewer weights and biases and hence make them a lot more efficient.

Can a feature be eliminated by combining two or more previously selected features? This is again to reduce the load on downstream algorithms. Many times we will choose to combine two (or more) features as one or break up one feature to multiple. But, care will be taken to never have redundancy in the features.

1.2.3 Feature Scaling

Does a feature have well defined classes (in case of categorical features) or well defined maximum and minimum values (in case of numerical features)?

All numerical features need to be scaled so that ML algorithms get a consistent view of all numerical features. For ex. A feature like “number of previous deployments” can range from 1 to 50 while a feature like “total distance travelled for an assignment” might range from 10 to 1000 miles. All such features need to be scaled down to a value between 0 and 1 before feeding them to any ML algorithm.

Most categorical features will be involved in probability distributions and to calculate these probabilities properly we will need to know all possible values a feature can take on. For ex. If disaster type is a feature, we will need to know exactly how many types there are and what their values are.

1.3 Additional Data Gathering

For the features we have picked, we will have to think about the following:

Which features do we already have data for?
Which features can we gather data for?
Which features can we not gather data for and hence have to drop it?

This is again a collaborative process between Pirates and ARC.

Pirates will be helping out in tasks such as:

Creating surveys to gather data from volunteers
Scripts to anonymize collected data and save it in backend data stores
Putting up simple web pages for people to enter in data collected offline
Going through legacy databases to extract information relevant to current project

< Proposal

Modeling >

1.0 Data Exploration and Feature Engineering

1.1 Data Gathering

The moment we get our hands on the dataset, we first explore the dataset. This includes knowing about the number and types of features in the dataset, performing some descriptive statistics on the dataset etc.

1.2 Feature Engineering

This section elaborates on the “Given a set of volunteers and a current disaster” part of the problem statement.

1.2.1 Feature Construction

Our first task is to find a number representation of volunteers and disasters which are representative without being redundant. Volunteers and disasters will be thought of as vectors, and each feature we pick will form a dimension of these vectors.

Sample volunteer features

Sample disaster features

1.2.2 Feature Reduction

1.2.3 Feature Scaling

Does a feature have well defined classes (in case of categorical features) or well defined maximum and minimum values (in case of numerical features)?

1.3 Additional Data Gathering

For the features we have picked, we will have to think about the following:

Which features do we already have data for?

Which features can we gather data for?

Which features can we not gather data for and hence have to drop it?

This is again a collaborative process between Pirates and ARC.

Pirates will be helping out in tasks such as:

Creating surveys to gather data from volunteers

Scripts to anonymize collected data and save it in backend data stores

Putting up simple web pages for people to enter in data collected offline

Going through legacy databases to extract information relevant to current project

Pirates

We created Pirates to give enterprises a platform to build their future by engaging a network of startups.