[Predictive Intelligence] Training ML Solutions [Classification/Similarity/Regression] can take a long time > 12 hours to train and it looks like the training is stuck when using the default Paragraph Vector [PV] setting


Details

In Predictive Intelligence when training ML Solutions [Classification/Similarity/Regression] using the default Paragraph Vector [PV] setting, for solution definitions with multiple Input fields and a large training dataset [300k+ records], it can take considerable time to complete the ML Solution training and users have reported that the training looks like it is "stuck", because the training has not progressed for 12+ hours.

You can also train the same ML Solutions [Classification/Similarity/Regression] using the Term Frequency–Inverse Document Frequency [TF-IDF] setting, which has been shown to reduce the training time considerably with ML Solutions that take a long time to train using the default setting. We have seen ML Solutions that take 15 hours to train using the default Paragraph Vector [PV] setting that are reduced to 1.5 hours training time, when using the Term Frequency–Inverse Document Frequency [TF-IDF] setting.

Training a solution with the Term Frequency–Inverse Document Frequency [TF-IDF] setting will generate a much larger trained solution model, which can be 10x larger than a trained solution model using the default Paragraph Vector [PV] setting. However, the model size will not impact the performance on the predictions made when using either trained solutions.

Additional Information

For further information on how to to configure the Term Frequency–Inverse Document Frequency [TF-IDF] setting in your ML Solutions, please review our documentation on Configure TF-IDF for classification, similarity, and regression solutions [Paris].