Sign in Start for free
sub header image
Support Library
Dataset Requirements
Requirements for Audio Files
To perform machine learning on the platform, it is necessary to provide data in a CSV format. For that purpose, Neuton developed an application for converting audio files. Below are the requirements for how audio files should be presented for this application:
Raw audio files should be in wav or wave formats.
Audio files should all be placed in a folder.
The folder must contain subfolders with samples of the same class (label).
The subfolder name will be used for class labeling. 
Folder Example
Audio Files
To validate the model on a separate piece of data, the audio files must also be converted to a CSV format. Data requirements are similar to those mentioned above.
Dataset Requirements for Sensor Data
The requirements below are provided for training and test datasets. Meeting these requirements will guarantee successful model training and prediction on the new data. We recommend using a text editor like Notepad++ to check the datasets.
Requirements for training datasets
A dataset must be a CSV file using UTF-8 or ISO-8859-1 encoding.
A dataset must have a minimum of 2 columns, 50 rows, and a header (51 rows in total). The first row in the dataset must contain the column names.
The file name must not contain the following characters: !/[+!@#$%^&*,. ?":{}\\/|<>()[]]
All feature values in the dataset must be numeric, the target column can also be represented by string values for classification task type. For regression task types, the target variable must have only numeric values.
A dataset must not have any empty values or values which represent empty values like “NA”, “NAN”, etc.
A comma, semicolon, pipe, caret, or tab must be used as a separator. CRLF or LF should be used as the end-of-line character. The separator and end-of-line character should be consistent inside the dataset.
All column names (values in the CSV file header) must be unique and must contain only letters (a-z, A-Z), numbers (0-9), hyphens (-), or underscores (_).
For the classification task type, a training dataset must have a minimum of 2 classes of target variable with at least 10 samples provided for each class.
Currently, Neuton supports only the EN-US locale for numbers, so:
You must use a dot as a decimal separator, and delete spaces and commas typically used to separate every third digit in your numeric fields. For example – "20,000.00" should be replaced with "20000.00"
If any numeric column is represented as a combination of a number and its corresponding unit, then only the number should be placed in the column. For example – "$20,000.00" should be replaced with "20000.00"
End-of-line symbols must be excluded from the field values.
In the case of sensor data from gyroscopes, accelerometers, magnetometers, electromyography (EMG), and other similar devices for creating models using Digital Signal Preprocessing, every row of a dataset should be device readings per unit of time with a label as a target. You should not shuffle signal labels or encode your signal for model creation.
For example, if the window is 8, then the dataset can be organized as follows:
Training Dataset
The number of lines must be excluded from the training dataset.
Data can be represented in the following types: INT8, INT16, FLOAT 32
It is acceptable to upload a dataset with a dates column, but it will be automatically removed during the training.
List of requirements for test datasets (or new data)
When transferring data for inference on the device, the values must always be in the same order as in the training dataset.
Datatype should match the train data type.
A dataset must be a CSV file using UTF-8 or ISO-8859-1 encoding.
The first row in the dataset must contain the column names, and a comma, semicolon, pipe, caret, or tab must be used as a separator. CRLF or LF should be used as the end of a line character.
The test dataset must have the same file structure with the same requirements for the feature values as the training dataset. The order of fields must be the same as in the training dataset.
End-of-line symbols must be excluded from the field values.
Test Dataset
How to identify and change file encoding
The Neuton Platform supports UTF-8 and ISO-8859-1 encoding for CSV files. Please check your file encoding and convert it to one of the supported options.
To check the current encoding of a file, open the file in the text editor of your choice (for example Notepad++). You will find the file encoding specified in the bottom right corner of the window.
UTF-8
If your file encoding differs from the options supported, you’ll need to convert it to UTF-8 and save the file. To change the encoding, select the Encoding menu and click “Convert to UTF-8”. When the conversion is complete, you will see “UTF-8” specified in the bottom right corner. Save the file to use on the Neuton platform.
Encoding
Dataset Requirements for Tabular Data
The requirements below are provided for training and test datasets. Meeting these requirements will guarantee successful model training and prediction on the new data. We recommend using a text editor like Notepad++ to check the datasets.
Requirements for training datasets
A dataset must be a CSV file using UTF-8 or ISO-8859-1 encoding.
A dataset must have a minimum of 2 columns, 50 rows, and headers.
The file name must not contain the following characters: !/[+!@#$%^&*,. ?":{}\\/|<>()[]]
All feature values in the dataset must be numeric, the target column can also be represented by string values for classification task type. For regression task types, the target variable must have only numeric values.
A dataset must not have any empty values or values which represent empty values like “NA”, “NAN”, etc.
The first row in the dataset must contain the column names. A comma, semicolon, pipe, caret, or tab must be used as a separator. CRLF or LF should be used as the end-of-line character. The separator and end-of-line character should be consistent inside the dataset.
All column names (values in the CSV file header) must be unique and must contain only letters (a-z, A-Z), numbers (0-9), hyphens (-), or underscores (_).
For the classification task type, a training dataset must have a minimum of 2 classes of target variable with at least 10 samples provided for each class.
Currently, Neuton supports only the EN-US locale for numbers, so:
You must use a dot as a decimal separator, and delete spaces and commas typically used to separate every third digit in your numeric fields. For example – "20,000.00" should be replaced with "20000.00"
If any numeric column is represented as a combination of a number and its corresponding unit, then only the number should be placed in the column. For example – "$20,000.00" should be replaced with "20000.00"
You can use any of the following date/timestamp formats:
"DD.MM.YYYY hh:mm:ss";
"MM.DD.YYYY hh:mm:ss";
"YYYY.MM.DD hh:mm:ss";
"YYYY.DD.MM hh:mm:ss";
"DD.MM hh:mm:ss";
"MM.DD hh:mm:ss";
"MM.YYYY hh:mm:ss";
"YYYY.MM hh:mm:ss";
You can use "/",":" or "." as date components separator. Space should be used as date and time separator and ":" should be used as time components separator.
The timestamp format should be consistent inside any given column.
In test datasets the same timestamp format should be used in a manner consistent with the training dataset.
End-of-line symbols must be excluded from the field values.
Training Dataset
List of requirements for test datasets (or new data)
A dataset must be a CSV file using UTF-8 or ISO-8859-1 encoding.
For prediction on the platform (using a web interface) the first row in the dataset must contain the column names. A comma, semicolon, pipe, caret, or tab must be used as a separator. CRLF or LF should be used as the end of a line character. When transferring data for inference on the device, the values must always be in the same order as in the training dataset (without target).
The test dataset must have the same file structure with the same requirements for the feature values as the training dataset.
The order of fields must be the same as in the training dataset. For example: if in the training dataset the order of columns is ‘B, C, A’. Then the input data for prediction must be in the same order: value for feature B, after that - value for feature C, and the next one - value for feature A.
End-of-line symbols must be excluded from the field values.
Test Dataset
How to identify and change file encoding
The Neuton Platform supports UTF-8 and ISO-8859-1 encoding for CSV files. Please check your file encoding and convert it to one of the supported options.
To check the current encoding of a file, open the file in the text editor of your choice (for example Notepad++). You will find the file encoding specified in the bottom right corner of the window.
UTF-8
If your file encoding differs from the options supported, you’ll need to convert it to UTF-8 and save the file. To change the encoding, select the Encoding menu and click “Convert to UTF-8”. When the conversion is complete, you will see “UTF-8” specified in the bottom right corner. Save the file to use on the Neuton platform.
Encoding


Stay updated, join the community