Dataset Requirements

The requirements below are provided for training and test (holdout) datasets. Meeting these requirements will guarantee successful model training and prediction on the new data. We recommend using a text editor like Notepad++ to check the datasets.

Requirements for training datasets

A dataset must be a CSV file using UTF-8 or ISO-8859-1 encoding.

The file name must not contain the following characters: !/[+!@#$%^&*,. ?":{}\\/|<>()[]]

All feature values in the dataset must be numeric. For the classification task type, values of the target variable should start with 0.</span >

A dataset must not have any empty values or values which represent empty values like “NA”, “NAN”, etc.

A comma, semicolon, pipe, caret, or tab must be used as a separator. CRLF or LF should be used as the end-of-line character. The separator and end-of-line character should be consistent inside the dataset.

All column names (values in the CSV file header) must be unique and must contain only letters (a-z, A-Z), numbers (0-9), hyphens (-), or underscores (_).

For the classification task type, a training dataset must have a minimum of 2 classes of target variable with at least 20 samples provided for each class.

Currently, Neuton supports only the EN-US locale for numbers, so:

You must use a dot as a decimal separator, and delete spaces and commas typically used to separate every third digit in your numeric fields. For example – "20,000.00" should be replaced with "20000.00"

If any numeric column is represented as a combination of a number and its corresponding unit, then only the number should be placed in the column. For example – "$20,000.00" should be replaced with "20000.00"

Date/time columns must be in epoch time representation or relative date format. For example, “10/18/2017” should be represented as “1508284800”.

In test datasets, the same timestamp format should be used in a manner consistent with the training dataset.

End-of-line symbols must be excluded from the field values.

In the case of sensor data from gyroscopes, accelerometers, magnetometers, electromyography (EMG), and other similar devices for creating models using Digital Signal Preprocessing, every row of a dataset should be device readings per unit of time with a label as a target. You should not shuffle signal labels or encode your signal for model creation.
For example, if the window is 8, then the dataset can be organized as follows:

Training Dataset

The number of lines must be excluded from the training dataset.

Data can be represented in the following types: INT8, INT16, FLOAT 32

List of requirements for test (holdout) datasets (or new data)

When transferring data for inference on the device, the values must always be in the same order as in the training dataset.

Datatype should match the train data type.

A dataset must be a CSV file using UTF-8 or ISO-8859-1 encoding.

The first row in the dataset must contain the column names, and a comma, semicolon, pipe, caret, or tab must be used as a separator. CRLF or LF should be used as the end of a line character.

The test (holdout) dataset must have the same file structure with the same requirements for the feature values as the training dataset. The order of fields must be the same as in the training dataset.

End-of-line symbols must be excluded from the field values.

Test (holdout) Dataset

How to identify and change file encoding

The Neuton Platform supports UTF-8 and ISO-8859-1 encoding for CSV files. Please check your file encoding and convert it to one of the supported options.

To check the current encoding of a file, open the file in the text editor of your choice (for example Notepad++). You will find the file encoding specified in the bottom right corner of the window.

UTF-8

If your file encoding differs from the options supported, you’ll need to convert it to UTF-8 and save the file. To change the encoding, select the Encoding menu and click “Convert to UTF-8”. When the conversion is complete, you will see “UTF-8” specified in the bottom right corner. Save the file to use on the Neuton platform.

Encoding