Please note that Neuton no longer supports Internet Explorer.
We recommend upgrading to the latest Microsoft Edge, Google Chrome, or Firefox.
Close
Benchmark models
Detailed descriptions of the datasets for benchmark models.
Dataset Information:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Attribute information (attributes may include empty or missing values):
1. PassengerId – Record ID
2. Survived – Status on passenger ( 1 – passenger survived in disaster, 0 – passenger did not survived) (TARGET VARIABLE)
3. Pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
4. Name - Passenger`s name
5. Sex – Passenger`s sex (male or female)
6. Age - Passenger`s age in years (Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5)
7. Sibsp – Number of siblings / spouses aboard the Titanic
The dataset defines family relations in this way:
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
8. Parch - Number of parents / children aboard the Titanic
The dataset defines family relations in this way:
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
9. Ticket - Passenger`s ticket number
10. Fare – Fare which passenger paid for ticket
11. Cabin – Passenger`s cabin number
12. Embarked – The port where passenger embarked
Task description:
Predict of what sorts of people were likely to survive(Survived column). It is a binary classification problem. Target metric – Accuracy.

Source:
https://www.kaggle.com/c/titanic/data
Dataset Information:
The dataset contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level, within an Italian city. Data were recorded from March 2004 to February 2005 (one year) representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
Attribute Information of the source dataset:
1. CO(GT) - True hourly averaged concentration CO in mg/m^3 (reference analyzer)
2. PT08.S1(CO) - (tin oxide) hourly averaged sensor response (nominally CO targeted)
3. NMHC(GT) - True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
4. C6H6(GT) - True hourly averaged Benzene concentration in microg/m^3 (reference analyzer) (TARGET VARIABLE)
5. PT08.S2(NMHC) - Titania. Hourly averaged sensor response (nominally NMHC targeted)
6. NOx(GT) - True hourly averaged NOx concentration in ppb (reference analyzer)
7. PT08.S3(NOx) - Tungsten oxide. Hourly averaged sensor response (nominally NOx targeted)
8. NO2(GT) - True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
9. PT08.S4(NO2) - Tungsten oxide. Hourly averaged sensor response (nominally NO2 targeted)
10. PT08.S5(O3) - Indium oxide. Hourly averaged sensor response (nominally O3 targeted)
11. T - Temperature in °C
12. RH - Relative Humidity (%)
13. AH - Absolute Humidity
Task description:
Predict Benzene concentration(C6H6(GT) column) based on the concentration of other recorded compounds. It is a regression problem. Target metric – MAE (Mean Absolute Error).

Source:
https://archive.ics.uci.edu/ml/datasets/Air+quality
Housing Values in Suburbs of Boston. The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE) and other attributes.
Sources:
(a) Origin: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
 (b) Creator: Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
 (c) Date: July 7, 1993
Past Usage:
Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261.
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Relevant Information:
Concerns housing values in suburbs of Boston.
Number of Instances:
506
Number of Attributes:
13 continuous attributes (including "class" attribute "MEDV"), 1 binary-valued attribute.
Attribute Information:
1. id – Record ID
2. crim - Per capita crime rate by town
3. zn - Proportion of residential land zoned for lots over 25,000 sq.ft.
4. indus - Proportion of non-retail business acres per town
5. chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
6. nox - Nitric oxides concentration (parts per 10 million)
7. rm - Average number of rooms per dwelling
8. age - Proportion of owner-occupied units built prior to 1940
9. dis - Weighted distances to five Boston employment centres
10. rad - Index of accessibility to radial highways
11. tax - Full-value property-tax rate per $10,000
12. ptratio - Pupil-teacher ratio by town
13. black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
14. lstat - % lower status of the population
15. medv - Median value of owner-occupied homes in $1000's (TARGET VARIABLE)
Missing Attribute Values:
None.
Task description:
Predict Median value of owner-occupied homes in $1000's(medv column) based on the other attributes. It is a regression problem. Target metric – RMSE (Root mean squared error).
Dataset Information:
All data related to active (operational) private schools in Dubai that includes curriculum, school ratings, students total enrollment and capacity per academic years.
Attribute Information of the source dataset:
1. School Name – The name of the school,
2. Location – School location,
3. Type of School - Type of school (Iranian, UK, MOE, US, Indian, IB, German, Pakistani, French, Japanese, Canadian, Russian, Other, SABIS, Phillippine, Phillipine)
4. 2010/11 Enrolments - Number of students in academic years 2010-2011,
5. 2011/12 Enrolments - Number of students in academic years 2011-2012,
6. 2012/13 Enrolments - Number of students in academic years 2012-2013,
7. 2013/14 Enrolments - Number of students in academic years 2013-2014,
8. 2014/15 Enrolments - Number of students in academic years 2014-2015,
9. 2015/16 Enrolments - Number of students in academic years 2015-2016,
10. 2008/09 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2008-2009 (values : 0,2,3,5),
11. 2009/10 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2009-2010 (values : 0,2,3,5),
12. 2010/11 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2010-2011 (values : 0,2,3,5),
13. 2011/12 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2011-2012 (values : 0,2,3,5),
14. 2012/13 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2012-2013 (values : 0,2,3,5),
15. 2013/14 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2013-2014 (values : 0,2,3,5),
16. 2014/15 DSIB Rating - Dubai School Inspection Bureau (DSIB) rating in academic years 2014-2015 (values : 0,2,3,5),
17. Target – School rating in academic years 2015-2016, (values : 1,2,3,4,5) (TARGET VARIABLE)
Task description:
Predict school rating ('Target' column after feature engineering) based on the other attributes. It is a multiclass classification problem. Target metric – Accuracy.
Dataset Information:
The dataset contains employee attrition data and it was created for uncovering the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.
Categorical attributes:
1. Education - (1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor')
2. EnvironmentSatisfaction - (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
3. JobInvolvement - (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
4. JobSatisfaction - (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
5. PerformanceRating - (1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding')
6. RelationshipSatisfaction - (1 'Low' 2 'Medium' 3 'High' 4 'Very High')
7. WorkLifeBalance – (1 'Bad' 2 'Good' 3 'Better' 4 'Best')
8. Age - Employee`s age in years(18-60)
9. Attrition - Fact of attrition ( Yes or No) (TARGET VARIABLE)
10. BusinessTravel - How often employee travels (Non, Rarely, Frequently
11. DailyRate - Daily salary
12. Department - Department where employee works
13. DistanceFromHome - Distance from home to office
14. EducationField - Direction of education
15. Gender - Employee's gender
16. HourlyRate - Hourly salary
17. JobLevel - Position level
18. JobRole - Position role
19. MaritalStatus - Marital status(Married, single, divorced )
20. MonthlyIncome - Monthly Salary
21. MonthlyRate - Monthly Rate
22. NumCompaniesWorked - Number of companies where employee previously worked.
23. OverTime - "YES" if employee works over time, "NO" - if does not
24. PercentSalaryHike - Percentage increase in salary between years
25. StockOptionLevel - How many stocks employee owns from the company where employee works now
26. TotalWorkingYears - How many years employee works
27. TrainingTimesLastYear - Total training time in last year
28. YearsAtCompany - How many years employee works in current company
29. YearsInCurrentRole - How many years employee works in current role
30. YearsSinceLastPromotion - How many years passed from last promotion
31. YearsWithCurrManager - Years spent with current manager
Task description:
Predict employee attrition (Attrition column). It is a binary classification problem. Target metric – Accuracy.
Contact Us
Get in touch to learn more