12.2 - STEP 1

IDevice Icon Step 1
STEP 1 - Parallel Corpus Preparation

Use one of the CzEng parallel corpus files provided here. Open the file in MS Excel and perform the following steps:

 

1. Soubor/Otevřít (všechny soubory). If a warning window pops up, press "Yes" button.

2. The Import wizard opens: make sure that "Oddělovač" radio button is chosen and press "Next" (2x) and "Finish".

3. The result should have 3 columns. Add a blank row at the very top for headings: "ID", "English", "Czech".

4. Activate filter by moving to Data tab and pressing the Filter icon.

5. Press filter drop-down arrow in the ID heading. Choose "Filtry textu/Má na začátku/...". Type first couple of characters of the domain of your choice (EU documents, or technical documentation). Do not use other than the 2 domains suggested!

6. While the filter is active, select all 3 rows press "Ctrl+C", move to new list and press "Ctrl+V". That way you should get rid of unnecessary rows.

7. Now you may get rid of the original file and save only the filtered result (there should be around 15,000 rows).
From that point, your task is to reduce the number of rows to exactly 4,000 rows (excluding the heading). As the quality of the chosen pairs will be evaluated, it is recommended that only rows of certain length and quality are selected.
In case you have trouble with accomplishing the task, watch the following two videos:
Source Editing Part 1 and Part 2.