NLPre-PL Dataset


The official NLPre-PL dataset - a uniformly paragraph-level divided version of NKJP1M corpus - the 1 million token balanced subcorpus of the National Corpus of Polish (Narodowy Korpus Jezyka Polskiego).

The NLPre dataset aims at fairly dividing the paragraphs length-wise and topic-wise into train, development, and test sets. Thus, we ensure a similar number of segments distribution per paragraph and avoid the situation when paragraphs with a small (or large) number of segments are available only e.g. during test time.

🤗 NLPre-PL Dataset 🤗 PDB-UD Dataset


NLPre-PL Trained models


Here are listed all available models, trained for the purpouse of creating NLPre-PL Benchmark.