top of page

How to start collecting training data fast, for free

When we speak to our past and present customers and partners we understand that enterprise data is the life blood of any conversational service.

Whilst conversational platforms are adept at marketing the profound theoretical power of their current Natural Language Understanding capability and that it is continually growing, the practical truth is that they are not frictionless in performance and neither is the customer data needed for any service served up in perfect form on a plate exactly when you need it.

The reality is, that although data may reside in an Enterprise, University, College or Partnership, it is often not perfect, maybe restricted due to legislation (GDPR, HiPPA) or obtaining it may just be poor / slow inter departmental data request and delivery processes. Data collection may be corrupt, poor data may be unclean and duplicated and even AI trainers can make less than perfect selections, when copy cleaning and preparing data.

Conversational Designers are quite clear that the most important first step is in getting clean, clear intents and customer data (perhaps scraped from twitter support data for speed) and spending time making sure that they have several super clean, distinct phrases. From there, supplementing these with similar phrases and some level of built in variance (without losing the meaning) makes for a great foundational conversational data set.

The human and machine chatbot training data starts with 4-5 super clean training phrases and then expand (or amplify) those to ensure a good quality set of clean similar and varied phrases using machine generated content. Amplifying or expanding out a quality data set with a machine generated service makes sense - at least on a time and cost basis and especially if an intent has 3-5 clean seed phrases which innate are varied. This process would be the optimal in a one person process to create many intents with up to 5 varied utterances which are amplified to around 15 phrases in total per intent.

If you needed further evidence of this, crowd sourcing companies state that each data creator (crowd worker) may only be able to create 3 phrases for each intent before being too repetitive, so adding other human resources make sense but take times and costs money to both instruct, quality check and pay.

Undertaking the seed utterance amplification service using a platform (like LevelFish’s) ensures that costs are kept low. Just one designer can create a complete clean data set for most sized chatbots (100 -500 intents say) AND there is less likelihood of bias as all training data will be similar volume for each intent. If you can save time, money and ensure quality and accuracy, then this approach not only makes a lot of sense economically but will help reduce a 25% cost of deployment most chatbots

Why do I need help with training data?

Just as a poor golf club grip stands in the way of every great shot, poor training data stands squarely as the primary obstacle between a good service and a bad service or to be blunt one that gets funded to role out as your pride and joy or the service that gets canned.

Utterance generation can reduce training bias and build costs, improve training data consistently, empower subject matter experts to retrain the responsibility for training. In other words, like low code chatbots, synthetic data generation empowers chatbot builders to be better bots with less knowledge and skills and money. This de-risks it for you the owner, the organisation and for the user win, win, win.

If it is so important then who else is supporting training data recommendations?

Interestingly recently trained an inference engine on Microsoft Turing data sets and they have moved on from merely identifying intent and phrase issues to also repairing them or reducing training data bias in the process, this makes perfect sense, if you diagnose issues, then why not fix them at the point of discovery too ? In August 2022, stated that utterance suggestions was now their no.1 feature used.

Botium is doing the same with a few utterance suggestions and you can use simple open source recommendations like Rasa has to infill issues in training data.

Both Botium and Qbox provide powerful QA testing platforms. What they are providing are great tools to fix issues pre-deployment.

What LevelFish is advocating is creating better data from the moment of ideation to help ensure that the process the builder and users traverse helps reduce overall costs, and most importantly helps ensure budget is secured to build and deploy by way of a better user experience, faster time to launch and lower costs.

Level Fish aims to make better roads at the outset rather than fill potholes just before you launch. Level fish also does not charge $10,000’s for great platforms to test chatbots, thats not our aim. We provide empowering tools to make you more successful, faster and cheaper and ensure you get a budget to birth your business case.

In that respect, if your idea doesn’t get funded and fly, it will never need testing anyway, as by then it is too late.


So whether you are a student, hobbyist, subject matter expert or Conversational Designer, amplifying your data at the outset make sense on so many levels as does ensuring 5 quality phrases per intent.

Level up and run LevelFish’s training data generator today to make a real difference before you wish you had. This link gets you to our free to use version, try it out, use it. You can generate over 150 utterance variations in seconds. Build a solid foundation, don’t in fill downstream. for both Excel and a Google Sheet - Enjoy and please give us feedback !

14 views0 comments

Recent Posts

See All
bottom of page