logo

The Kinyarwanda tokenizer is a powerful tool for processing Kinyarwanda text. It is designed to provide fast and accurate tokenization, with a focus on quality and performance.

Alta-tokenizer is a Python library designed for tokenizing Kinyarwanda language text, it can also tokenizer other languages like English, French or similar languages but with low compression rate since the tokenizer was trained on Kinyarwanda text only. There is an option for training your own custom tokenizer using defined function or method. It is covered in the section of training your own tokenizer. Hence, you can use that way to train your own tokenizer using dataset for a different language. It is based on the Byte Pair Encoding (BPE) algorithm. It can both encode and decode text in Kinyarwanda.

The metric used to measure the accuracy of this tokenizer is the compression rate and ability to encode and decode texts Compression rate is the ratio of the total original number of characters in the text to the number of tokens in the encoded text.

For example the sentence: "Nagiye gusura abanyeshuri."
The sentence has 26 characters. Suppose the sentence is tokenized into the following tokens:
[78, 1760, 32, 5256, 32, 1845, 46]. The total number of tokens is 7. So, the compression rate is 3.714X(where X indicates that the number is approximate).

Custom Training

Custom Training

offers an easy-to-use interface to train your own tokenizer. By supplying a custom dataset and using the provided training function, users can retrain with on a custom dataset to improve it for specific use case.

Language Optimization

Language Optimization

Built on the Byte-Pair Encoding algorithm, it efficiently processes texts and preserving their meaning. it is designed specifically for Kinyarwanda, it leverages its unique linguistic patterns to outperform generic tokenizers on kinyarwanda.

Easy Integration

Easy Integration

With its Python package and clear documentation, alta-tokenizer can be integrated into existing NLP pipelines, making it a versatile tool for developers and researchers.

Language Flexibility

Language Flexibility

Although optimized for Kinyarwanda, alta-tokenizer can be applied to other languages—ideal for multilingual projects or cases where you want a custom solution that fits your specific language data.