Please note that Neuton no longer supports Internet Explorer.
We recommend upgrading to the latest Microsoft Edge, Google Chrome, or Firefox.
Close
Benchmark models
Detailed descriptions of the datasets for benchmark models.
Titanic
Dataset Information:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Attribute information:
1. survival - Survival (0 = No, 1 = Yes)
2. pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
3. sex - Sex
4. Age - Age in years
5. sibsp - # of siblings / spouses aboard the Titanic
6. parch - # of parents / children aboard the Titanic
7. ticket - Ticket number
8. fare - Passenger fare
9. cabin - Cabin number
10. embarked - Port of Embarkation
Variable Notes:
1. pclass: A proxy for socio-economic status (SES)
    1st = Upper
    2nd = Middle
    3rd = Lower
2. age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
3. sibsp: The dataset defines family relations in this way...
4. Sibling = brother, sister, stepbrother, stepsister
5. Spouse = husband, wife (mistresses and fiancés were ignored)
6. parch: The dataset defines family relations in this way...
7. Parent = mother, father
8. Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Task description:
Predict of what sorts of people were likely to survive(Survived column).
It is a binary classification problem.

Source: https://www.kaggle.com/c/titanic/data
School
Dataset Information:

All data related to active (operational) private schools in Dubai that includes curriculum, school ratings, students total enrollment and capacity per academic years.

Attribute Information of the source dataset:
1. name_eng The name of the Private School in English
2. educationcenterid The Education Center ID. In this dataset it is the Private school ID
3. academic_year The academic year for school inspection
4. curriculumen The school curriculum in English
5. grades_years The available School Grades
6. students Number of Students in Academic year
7. school_rating School Inspection Rating per academic year
8. current_capacity School Capacity in latest academic year
9. location The school location in English (AREA)
10. school_type The school type (Profit, non-profit, charity,…).

Task description:
Predict school rating ('Target' column after feature engineering) based on the other attributes.

Source: https://www.dubaipulse.gov.ae/data/khda-schools/khda_private_schools_erc-open?organisation=khda&service=khda-schools
HR
Dataset Information:

The dataset contains employee attrition data and it was created for uncovering the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

Categorical attributes:

1. Education 1 ('Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor')
2. EnvironmentSatisfaction (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
3. JobInvolvement (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
4. JobSatisfaction 1 ('Low' 2 'Medium' 3 'High' 4 'Very High')
5. PerformanceRating (1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding')
6. RelationshipSatisfaction (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
7. WorkLifeBalance 1 ('Bad' 2 'Good' 3 'Better' 4 'Best')

Task description:
Predict employee attrition (Attrition column).

Source: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
Boston Housing Data
Data Set Information:

Title: Boston Housing Data

Sources:
  (a) Origin: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
  (b) Creator: Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
  (c) Date: July 7, 1993

Past Usage:
 -   Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261.
 -   Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Relevant Information:
Concerns housing values in suburbs of Boston.

Number of Instances: 506

Number of Attributes: 13 continuous attributes (including "class" attribute "MEDV"), 1 binary-valued attribute.

Attribute Information:
   1. CRIM per capita crime rate by town
   2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
   3. INDUS proportion of non-retail business acres per town
   4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
   5. NOX nitric oxides concentration (parts per 10 million)
   6. RM average number of rooms per dwelling
   7. AGE proportion of owner-occupied units built prior to 1940
   8. DIS weighted distances to five Boston employment centres
   9. RAD index of accessibility to radial highways
   10. TAX full-value property-tax rate per $10,000
   11. PTRATIO pupil-teacher ratio by town
   12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
   13. LSTAT % lower status of the population
   14. MEDV Median value of owner-occupied homes in $1000's

Missing Attribute Values: None.

Task description:
Predict Median value of owner-occupied homes in $1000's(MEDV column) based on the other attributes.

Source: https://archive.ics.uci.edu/ml/machine-learning-databases/housing
AIR
Dataset Information:

The dataset contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005 (one year) representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.

Attribute Information of the source dataset:

1. Date (DD/MM/YYYY)
2. Time (HH.MM.SS)
3. True hourly averaged concentration CO in mg/m^3 (reference analyzer)
4. PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
5. True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
6. True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
7. PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
8. True hourly averaged NOx concentration in ppb (reference analyzer)
9. PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
10. True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
11. PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
12. PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
13. Temperature in °C
14. Relative Humidity (%)
15. AH Absolute Humidity

Task description:
Predict Benzene concentration(C6H6(GT) column) based on the concentration of other recorded compounds.

Source: https://archive.ics.uci.edu/ml/datasets/Air+quality\
Contact Us
Get in touch to learn more