This script generates a dataset for testing strict string matching. It creates composite strings containing random numeric IDs and specific IDs (23213324 and 213213321), separated by random delimiters (-, /, ;). Each sample includes a query string and a label (1 for an exact match, 0 for no match). The dataset enforces a rule where even a single-character difference results in no match. The generated dataset is saved as a CSV file for further use.
This trains a BERT-based machine learning model to identify whether a specific ID (query_string) exists within a list of IDs (composite_string), using separators like -, /, or ;. The dataset includes labeled examples (1 for matches, 0 for no matches) and tests strict matching, meaning even a single character difference results in a non-match.
The model leverages BERT to encode the relationships between the query and composite strings. After training on a large dataset, it achieved near-perfect validation accuracy of 99.95%.
However the model doesn’t predict good result for new inputs
