author | sl |
Tue, 10 Jun 2014 14:32:02 +0200 | |
changeset 1 | 260cb5ec6c19 |
permissions | -rw-r--r-- |
1 # This file is derived from
2 #
3 # http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
4 #
5 # Which was created by Markus Kuhn <mkuhn@acm.org> - 2000-09-02
6 #
7 # lines begining with # and blank lines are ignored
8 #
9 # Beyond that, this file consists of a series of test cases. Each test case consists of
10 # 2 or 3 lines:
11 #
12 # 1. A UTF-8 string
13 # 2. A status
14 # VALID : The string is a valid UTF-8 representation of valid Unicode
15 # INCOMPLETE : The string has a partial character at the end
16 # NOTUNICODE : The string is valid UTF-8, but the characters represented
17 # are not valid unicode (
18 # OVERLONG : The string includes overlong sequences
19 # MALFORMED : The string is not valid UTF-8
20 # 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
21 # as a series of hex numbers.
23 # 1 Some correct UTF-8 text
24 κόσμε
25 VALID
26 03ba 1f79 03c3 03bc 03b5
28 # 2.1 First possible sequence of a certain length
29 #
30 # FIXME - handle NULLS?
31 #
32 # [ NULL BYTE ]
33 #VALID
34 #0000
36
37 VALID
38 0080
41 NOTUNICODE
42 00200000
45 NOTUNICODE
46 04000000
48
49 VALID
50 0000007f
52 ߿
53 VALID
54 000007ff
57 NOTUNICODE
58 0000ffff
61 NOTUNICODE
62 001fffff
65 NOTUNICODE
66 03ffffff
69 NOTUNICODE
70 7fffffff
72 # 2.3 Other boundary conditions
74
75 VALID
76 d7ff
78 �
79 VALID
80 fffd
82
83 NOTUNICODE
84 0010ffff
87 NOTUNICODE
88 00110000
90 # 3.1 Unexpected continuation bytes
93 MALFORMED
95 MALFORMED
97 MALFORMED
99 MALFORMED
101 MALFORMED
103 MALFORMED
105 MALFORMED
107 MALFORMED
109 MALFORMED
111 # 3.2 Lonely start characters
114 MALFORMED
116 MALFORMED
118 MALFORMED
120 MALFORMED
122 MALFORMED
124 # 3.3 Sequences with last continuation byte missing
127 INCOMPLETE
129 INCOMPLETE
131 INCOMPLETE
133 INCOMPLETE
135 INCOMPLETE
137 INCOMPLETE
139 INCOMPLETE
141 INCOMPLETE
143 INCOMPLETE
145 INCOMPLETE
147 # 3.4 Concatenation of incomplete sequences
150 MALFORMED
152 # 3.5 Impossible bytes
155 MALFORMED
157 MALFORMED
159 MALFORMED
161 # Examples of an overlong ASCII character
164 OVERLONG
166 OVERLONG
168 OVERLONG
170 OVERLONG
172 OVERLONG
174 # Maximum overlong sequences
177 OVERLONG
179 OVERLONG
181 OVERLONG
183 OVERLONG
185 OVERLONG
187 # Overlong representation of the NUL character
190 OVERLONG
192 OVERLONG
194 OVERLONG
196 OVERLONG
198 OVERLONG
200 # Illegal code positions
202 # Single UTF-16 surrogates
205 NOTUNICODE
206 d800
209 NOTUNICODE
210 db7f
213 NOTUNICODE
214 db80
217 NOTUNICODE
218 dbff
221 NOTUNICODE
222 dc00
225 NOTUNICODE
226 df80
229 NOTUNICODE
230 dfff
232 # Paired UTF-16 surrogates
235 NOTUNICODE
236 d800 dc00
239 NOTUNICODE
240 d800 dfff
243 NOTUNICODE
244 db7f dc00
247 NOTUNICODE
248 db7f dfff
251 NOTUNICODE
252 db80 dc00
255 NOTUNICODE
256 db80 dfff
259 NOTUNICODE
260 dbff dc00
263 NOTUNICODE
264 dbff dfff
266 # Other illegal code positions
269 NOTUNICODE
270 fffe
273 NOTUNICODE
274 ffff
276 ################
277 #
278 # Some more tests, not from Markus Kuhn's file
279 #
281 # Mixed plane 0 and higher planes