sl@0: # This file is derived from sl@0: # sl@0: # http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt sl@0: # sl@0: # Which was created by Markus Kuhn - 2000-09-02 sl@0: # sl@0: # lines begining with # and blank lines are ignored sl@0: # sl@0: # Beyond that, this file consists of a series of test cases. Each test case consists of sl@0: # 2 or 3 lines: sl@0: # sl@0: # 1. A UTF-8 string sl@0: # 2. A status sl@0: # VALID : The string is a valid UTF-8 representation of valid Unicode sl@0: # INCOMPLETE : The string has a partial character at the end sl@0: # NOTUNICODE : The string is valid UTF-8, but the characters represented sl@0: # are not valid unicode ( sl@0: # OVERLONG : The string includes overlong sequences sl@0: # MALFORMED : The string is not valid UTF-8 sl@0: # 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string, sl@0: # as a series of hex numbers. sl@0: sl@0: # 1 Some correct UTF-8 text sl@0: κόσμε sl@0: VALID sl@0: 03ba 1f79 03c3 03bc 03b5 sl@0: sl@0: # 2.1 First possible sequence of a certain length sl@0: # sl@0: # FIXME - handle NULLS? sl@0: # sl@0: # [ NULL BYTE ] sl@0: #VALID sl@0: #0000 sl@0: sl@0: € sl@0: VALID sl@0: 0080 sl@0: sl@0: sl@0: NOTUNICODE sl@0: 00200000 sl@0: sl@0: sl@0: NOTUNICODE sl@0: 04000000 sl@0: sl@0:  sl@0: VALID sl@0: 0000007f sl@0: sl@0: ߿ sl@0: VALID sl@0: 000007ff sl@0: sl@0: ￿ sl@0: NOTUNICODE sl@0: 0000ffff sl@0: sl@0: sl@0: NOTUNICODE sl@0: 001fffff sl@0: sl@0: sl@0: NOTUNICODE sl@0: 03ffffff sl@0: sl@0: sl@0: NOTUNICODE sl@0: 7fffffff sl@0: sl@0: # 2.3 Other boundary conditions sl@0: sl@0: ퟿ sl@0: VALID sl@0: d7ff sl@0: sl@0: � sl@0: VALID sl@0: fffd sl@0: sl@0: 􏿿 sl@0: NOTUNICODE sl@0: 0010ffff sl@0: sl@0: sl@0: NOTUNICODE sl@0: 00110000 sl@0: sl@0: # 3.1 Unexpected continuation bytes sl@0: sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: # 3.2 Lonely start characters sl@0: sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: # 3.3 Sequences with last continuation byte missing sl@0: sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: INCOMPLETE sl@0: sl@0: # 3.4 Concatenation of incomplete sequences sl@0: sl@0: sl@0: MALFORMED sl@0: sl@0: # 3.5 Impossible bytes sl@0: sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: MALFORMED sl@0: sl@0: # Examples of an overlong ASCII character sl@0: sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: # Maximum overlong sequences sl@0: sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: # Overlong representation of the NUL character sl@0: sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: OVERLONG sl@0: sl@0: # Illegal code positions sl@0: sl@0: # Single UTF-16 surrogates sl@0: sl@0: sl@0: NOTUNICODE sl@0: d800 sl@0: sl@0: sl@0: NOTUNICODE sl@0: db7f sl@0: sl@0: sl@0: NOTUNICODE sl@0: db80 sl@0: sl@0: sl@0: NOTUNICODE sl@0: dbff sl@0: sl@0: sl@0: NOTUNICODE sl@0: dc00 sl@0: sl@0: sl@0: NOTUNICODE sl@0: df80 sl@0: sl@0: sl@0: NOTUNICODE sl@0: dfff sl@0: sl@0: # Paired UTF-16 surrogates sl@0: sl@0: sl@0: NOTUNICODE sl@0: d800 dc00 sl@0: sl@0: sl@0: NOTUNICODE sl@0: d800 dfff sl@0: sl@0: sl@0: NOTUNICODE sl@0: db7f dc00 sl@0: sl@0: sl@0: NOTUNICODE sl@0: db7f dfff sl@0: sl@0: sl@0: NOTUNICODE sl@0: db80 dc00 sl@0: sl@0: sl@0: NOTUNICODE sl@0: db80 dfff sl@0: sl@0: sl@0: NOTUNICODE sl@0: dbff dc00 sl@0: sl@0: sl@0: NOTUNICODE sl@0: dbff dfff sl@0: sl@0: # Other illegal code positions sl@0: sl@0: ￾ sl@0: NOTUNICODE sl@0: fffe sl@0: sl@0: ￿ sl@0: NOTUNICODE sl@0: ffff sl@0: sl@0: ################ sl@0: # sl@0: # Some more tests, not from Markus Kuhn's file sl@0: # sl@0: sl@0: # Mixed plane 0 and higher planes sl@0: