================ Sourced Strings ================ "Sourced strings" are strings that are annotated with information about the location in a document where they were originally found. Sourced strings are subclassed from Python strings. As a result, they can usually be used anywhere a normal Python string can be used. Creating Sourced Strings ======================== A sourced string for a document can be constructed by calling the `SourcedString` constructor with two arguments: a Python string (containing the contents of the document), and a document identifier (such as a file name): >>> from nltk.sourcedstring import * >>> newt_contents = """\ ... She turned me into a newt! ... I got better.""" >>> newt_doc = SourcedString(newt_contents, 'newt.txt') >>> print repr(newt_doc) 'She turned me into a newt!\nI got better.'@[0:40] >>> newt = newt_doc.split()[5] # Find the sixth word. >>> print repr(newt) 'newt!'@[21:26] The suffix ``@[0:40]`` at the end of ``newt_doc``'s string representation indicates that it is a sourced string beginning at offset 0, and ending at offset 40. Similarly, the suffix ``@[21:26]`` at the end of ``newt``'s string representation indicates that it spans from offset 21 to offset 26. .. note:: The `SourcedString` constructor automatically delegates to either `SimpleSourcedByteString` or `SimpleSourcedUnicodeString`, depending on whether its first argument has type ``str`` or ``unicode``. The subclasses of `SourcedString` are discussed in more detail in `Unicode and Sourced Strings`_. Sourced strings can also be created using the `SourcedStringStream` class, which wraps an existing stream object, and causes its read methods to return sourced strings. >>> from StringIO import StringIO >>> stream = SourcedStringStream(StringIO(newt_contents)) >>> for line in stream: ... print repr(line) 'She turned me into a newt!\n'@[0:27] 'I got better.'@[27:40] Finally, some of NLTK's corpus readers can be instructed to return sourced strings instead of Python strings: >>> from nltk.corpus import gutenberg >>> emma_words = gutenberg.words('austen-emma.txt', sourced=True) >>> remembrance = emma_words[114] >>> print repr(remembrance) 'remembrance'@[552:563] >>> emma_sents = gutenberg.sents('austen-emma.txt', sourced=True) >>> emma_sents[28] # doctest: +NORMALIZE_WHITESPACE ['The'@[4980:4983], 'Woodhouses'@[4984:4994], 'were'@[4995:4999], 'first'@[5000:5005], 'in'@[5006:5008], 'consequence'@[5009:5020], 'there'@[5021:5026], '.'@[5026]] String Sources ============== The location where a sourced string was found is recorded using the ``source`` attribute: >>> newt.source StringSource('newt.txt', begin=21, end=26) >>> remembrance.source StringSource('austen-emma.txt', begin=552, end=563) Sources are encoded using `StringSource` objects, which consist of a document identifier along with information about the offsets of the characters that make up the string. These offsets are typically either byte offsets or character offsets. (As we'll see below, byte offsets and character offsets are not equivalent when used to describe unicode strings.) String Sources define four attributes that describe the location where a string was found: ``docid``, ``begin``, ``end``, and ``offsets``. The ``docid`` attribute contains an identifier (such as a filename) that names the document where the string was found: >>> newt.source.docid 'newt.txt' The ``begin`` and ``end`` attributes should be interpreted in the same way as Python slice indices. In particular, the ``begin`` index specifies the offset of the first character in the string; and the ``end`` index specifies the offset just past the last character in the string: >>> newt.source.begin 21 >>> newt.source.end 26 >>> newt_contents[newt.source.begin:newt.source.end] 'newt!' The ``offsets`` attribute returns a tuple of offsets specifying the location of each character in the document: >>> newt.source.offsets (21, 22, 23, 24, 25, 26) In particular, for a `SourcedString` ``s``, character ``s[i]`` begins at offset ``s.source.offsets[i]`` and ends at offset ``s.source.offsets[i+1]``. Note that the ``offsets`` list contains one more offset than there are characters in the string: >>> len(newt), len(newt.source.offsets) (5, 6) That's because the `StringSource` specifies both the begin offset and the end offset for each character. The ``begin`` and ``end`` attributes are always equal to the first and last elements of the ``offsets`` attribute, respectively: >>> assert newt.source.begin == newt.source.offsets[0] >>> assert newt.source.end == newt.source.offsets[-1] The `pprint()` method (which stands for "pretty-print") is helpful for showing the relationship between offsets and characters. In the following example, compare the pretty-printed document with the list of offsets in newt's source: >>> print newt_doc.pprint(wrap='\n') [=======================newt.txt=======================] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+ |S|h|e| |t|u|r|n|e|d| |m|e| |i|n|t|o| |a| |n|e|w|t|!|\n| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+ [=========newt.txt========] 2 2 2 3 3 3 3 3 3 3 3 3 3 4 7 8 9 0 1 2 3 4 5 6 7 8 9 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+ |I| |g|o|t| |b|e|t|t|e|r|.| +-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> newt.source.offsets (21, 22, 23, 24, 25, 26) At first, it may seem redundant to keep track of the offsets for every character in a string -- for many strings, the offset of ``s[i]`` is simply ``s.begin+i``. However, when byte offsets are used to describe unicode characters, we can no longer assume that the characters in a string have consecutive offsets. In the following example, we construct a `SourcedString` from a utf-8 encoded byte string (thus ensuring that we are using byte offsets); and then decode that string to unicode. When we print the `SourcedString`, we can see that several of its characters span two bytes: >>> students_and_time = SourcedString("""\ ... Le temps est un grand ma\xc3\xaetre, dit-on, le malheur est \ ... qu'il tue ses \xc3\xa9l\xc3\xa8ves""", 'Berlioz').decode('utf-8') >>> print students_and_time.pprint() [==============================Berlioz===============================] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 6 7 8 9 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+------+-+-+-+-+-+-+-+ |L|e| |t|e|m|p|s| |e|s|t| |u|n| |g|r|a|n|d| |m|a|\u00ee|t|r|e|,| |d|i| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+------+-+-+-+-+-+-+-+ [===============================Berlioz===============================] 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |t|-|o|n|,| |l|e| |m|a|l|h|e|u|r| |e|s|t| |q|u|'|i|l| |t|u|e| |s|e|s| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ [=======Berlioz=======] 6 7 7 7 7 7 7 8 0 1 3 4 5 6 +------+-+------+-+-+-+ |\u00e9|l|\u00e8|v|e|s| +------+-+------+-+-+-+ StringSource Subclasses ----------------------- In order to efficiently encode the sources of strings with consecutive characters while also accommodating strings without consecutive characters, the `StringSource` class defines two subclasses: - `ConsecutiveCharStringSource` is used to describe the source of strings whose characters have consecutive offsets. In particular, it is used for byte strings with byte offsets; and unicode strings with character offsets. It is encoded using a document identifier, a begin offset, and an end offset: >>> ConsecutiveCharStringSource('newt.txt', begin=12, end=18) StringSource('newt.txt', begin=12, end=18) - `ContiguousCharStringSource` is used to describe the source of strings whose characters are contiguous, but do not necessarily have consecutive offsets. In particular, it is used for unicode strings with byte offsets. It is encoded using a document identifier and a tuple of character offsets. >>> ContiguousCharStringSource('newt.txt', offsets=[12, 15, 16, 18]) StringSource('newt.txt', offsets=(12, 15, 16, 18)) The `StringSource` class itself is an abstract base class; but its constructor automatically delegates to the appropriate subclass, depending on how it was called: >>> StringSource('newt.txt', begin=12, end=18) StringSource('newt.txt', begin=12, end=18) >>> type(StringSource('newt.txt', begin=12, end=18)) >>> StringSource('newt.txt', offsets=[12, 15, 16, 18]) StringSource('newt.txt', offsets=(12, 15, 16, 18)) >>> type(StringSource('newt.txt', offsets=[12, 15, 16, 18])) SourcedString Source Attributes ------------------------------- For convenience, the ``SourcedString`` class defines the attributes ``begin``, ``end``, and ``docid``. Their value is identical to the corresponding attribute of the string's source: >>> assert newt.begin == newt.source.begin >>> assert newt.end == newt.source.end >>> assert newt.docid == newt.source.docid As we'll see `below `_, these three attributes are only defined for "simple sourced strings" -- i.e., strings that correspond to a single substring of a document. They are not defined for "compound sourced strings," which are constructed by concatenating strings from multiple sources. Substrings of Sourced Strings ============================= Operations that return substrings of a `SourcedString` (such as slicing, indexing, and `split()`) return them as `SourcedString`s: >>> newt_doc[4:10] 'turned'@[4:10] >>> newt_doc[5] 'u'@[5] >>> newt_doc.split() # doctest: +NORMALIZE_WHITESPACE ['She'@[0:3], 'turned'@[4:10], 'me'@[11:13], 'into'@[14:18], 'a'@[19], 'newt!'@[21:26], 'I'@[27], 'got'@[29:32], 'better.'@[33:40]] >>> newt_doc[:4].strip() 'She'@[0:3] Most regular expression operations will also return a `SourcedString` when given a `SourcedString` as input: >>> import re >>> re.findall(r'\w*e\w*', newt_doc) # doctest: +NORMALIZE_WHITESPACE ['She'@[0:3], 'turned'@[4:10], 'me'@[11:13], 'newt'@[21:25], 'better'@[33:39]] >>> re.search(r'\w+ed', newt_doc).group() 'turned'@[4:10] The exception to this rule is the regular expression substitution operations, ``re.sub`` and ``re.subn``. See `Limitations`_ for more information. Compound Sourced Strings ======================== When sourced strings are concatenated with other strings, the result is a compound sourced string: >>> better_newt = 'My orange ' + newt_doc[21:25] + ' is ' + newt_doc[33:39] >>> print better_newt.pprint() [newt.tx] [==newt.txt=] 2 2 2 2 2 3 3 3 3 3 3 3 1 2 3 4 5 3 4 5 6 7 8 9 +----------+-+-+-+-+----+-+-+-+-+-+-+ |My orange |n|e|w|t| is |b|e|t|t|e|r| +----------+-+-+-+-+----+-+-+-+-+-+-+ >>> print repr(better_newt) 'My orange newt is better'@[...,21:25,...,33:39] Compound sourced strings keep track of the sources of all the substrings they were composed from. The pieces that make up a compound sourced string can be retrieved using the ``substrings`` attribute: >>> better_newt.substrings ('My orange ', 'newt'@[21:25], ' is ', 'better'@[33:39]) The substrings of a compound sourced string are always either simple sourced strings or Python strings, never compound sourced strings. Slicing Compound Sourced Strings -------------------------------- The type of object that is returned by slicing a compound sourced string will depend on what portion of the compound sourced string is covered by the slice. If the slice falls within a single Python substring, then it will be returned as a Python string: >>> better_newt[3:9] # Returns a Python string 'orange' If the slice falls within a single simple sourced string, then it will be returned as a simple sourced string: >>> better_newt[10:13] # Returns a simple sourced string 'new'@[21:24] Otherwise, it will be returned as a compound sourced string: >>> better_newt[3:14] # Returns a compound sourced string 'orange newt'@[...,21:25] Note that a single-character sourced string may never be compound; therefore, indexing a Sourced String will always return either a Python character or a simple sourced string: >>> better_newt[8] 'e' >>> better_newt[10] 'n'@[21] If you are not sure what type of string will result from an operation, you can use ``isinstance()`` to check whether it's a Python string, a simple sourced string, or a compound sourced string: >>> def check_type(s): ... if isinstance(s, SimpleSourcedString): ... print 'simple sourced string' ... elif isinstance(s, CompoundSourcedString): ... print 'compound sourced string' ... else: ... print 'python string' >>> check_type(better_newt[1:2]) python string >>> check_type(better_newt[10:13]) simple sourced string >>> check_type(better_newt[3:14]) compound sourced string Alternatively, you can use ``hasattr()`` to check whether a substring has a source: >>> hasattr(better_newt[1:2], 'source') # Python string False >>> hasattr(better_newt[10:13], 'source') # Simple sourced string True >>> hasattr(better_newt[3:14], 'source') # Compound sourced string False Concatenating Compound Sourced Strings -------------------------------------- When two compound sourced strings ``c1`` and ``c2`` are concatenated together, the resulting compound sourced string ``c3`` does *not* contain ``c1`` and ``c2`` themselves as substrings. Instead, ``c3`` contains ``c1``'s substrings and ``c2``'s substrings. This "flattening" ensures that the substrings of a compound sourced string will always be either Python strings or simple sourced strings, and never compound strings. >>> c1 = better_newt >>> c2 = ' than your ' + better_newt[3:14] >>> c3 = c1+c2 >>> for substring in c3.substrings: ... print '%25r %s' % (substring, type(substring).__name__) 'My orange ' str 'newt'@[21:25] SimpleSourcedByteString ' is ' str 'better'@[33:39] SimpleSourcedByteString ' than your orange ' str 'newt'@[21:25] SimpleSourcedByteString Multi-Document Sourced Strings ------------------------------ It is possible to concatenate sourced strings that come from different documents: >>> doc2 = SourcedString("Hello World", 'hello.txt') >>> cello = ("Does "+newt[:-1].capitalize()+"on like my " + ... doc2[:5].replace('H','C')+"?") >>> print cello.pprint() [newt.tx] [hello.t] 2 2 2 2 2 1 2 3 4 5 1 2 3 4 5 +-----+-+-+-+-+------------+-+-+-+-+-+ |Does |N|e|w|t|on like my C|e|l|l|o|?| +-----+-+-+-+-+------------+-+-+-+-+-+ Transforming Sourced Strings ============================ The `SourcedString` methods that return a modified string will preseve source information whenever possible. Case Modification ----------------- Case modification methods return a sourced string with the same source as the original string: >>> sent = newt_doc.split('\n')[0] >>> sent.lower() 'she turned me into a newt!'@[0:26] >>> sent.title() 'She Turned Me Into A Newt!'@[0:26] >>> better_newt.title() 'My Orange Newt Is Better'@[...,21:25,...,33:39] In particular, the characters that are modified keep their original source information. This is in contrast with the `replace()` method (discussed below), where the replacement string has its own source information. Justification methods --------------------- The string justification methods preserve source information of the original string. The padding substring will usually be sourceless (unless you supply a sourced string as the fill character): >>> print newt.rjust(15).pprint() [=newt.txt] 2 2 2 2 2 2 1 2 3 4 5 6 +----------+-+-+-+-+-+ | |n|e|w|t|!| +----------+-+-+-+-+-+ >>> print newt.center(15, '.').pprint() [=newt.txt] 2 2 2 2 2 2 1 2 3 4 5 6 +-----+-+-+-+-+-+-----+ |.....|n|e|w|t|!|.....| +-----+-+-+-+-+-+-----+ Replacement Method ------------------ The ``replace`` method preserves string information for both the original string and the replacement string: >>> print sent.replace('newt', doc2[6:]).pprint() [=================newt.txt================][hello.txt][n] 1 1 1 1 1 1 1 1 1 1 2 2 1 12 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 16 7 8 9 0 15 6 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-++-+ |S|h|e| |t|u|r|n|e|d| |m|e| |i|n|t|o| |a| ||W|o|r|l|d||!| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-++-+ If the replacement string is a Python string, then the corresponding substring will be sourceless: >>> print newt.replace('!', 'on').pprint() [newt.tx] 2 2 2 2 2 1 2 3 4 5 +-+-+-+-+--+ |n|e|w|t|on| +-+-+-+-+--+ Other Modifications ------------------- Unfortunately, several other modification methods (such as ``str.join()`` and ``re.sub()``) do not always preserve source information. See `Limitations`_ for more details. Unicode and Sourced Strings =========================== The `SourcedString` class is an abstract base class. It defines two abtract subclasses, each of which defines two concrete subclasses:: SourcedString (abstract) | +-- SimpleSourcedString (abstract) | | | +-- SimpleSourcedByteString | | | +-- SimpleSourcedUnicodeString | +-- CompoundSourcedString (abstract) | +-- CompoundSourcedByteString | +-- CompoundSourcedUnicodeString The two ``-ByteString`` classes are subclassed from ``str``; and the two ``-UnicodeString`` classes are subclassed from ``unicode``. When the `SourcedString` constructor is called directly, it will delegate to the appropriate subclass, based on the type of the content string: >>> byte_str = 'He was a tall lumberjack.' >>> SourcedString(byte_str, 'lumberjack.txt') 'He was a tall lumberjack.'@[0:25] >>> type(SourcedString(byte_str, 'lumberjack.txt')) >>> unicode_str = u'He was a tall lumberjack.' >>> SourcedString(unicode_str, 'lumberjack.txt') u'He was a tall lumberjack.'@[0:25] >>> type(SourcedString(unicode_str, 'lumberjack.txt')) The ``CompoundSourced-*-String`` classes are not usually instantiated directly; instead, they are created by concatenating sourced strings with other strings. See `Compound Sourced Strings`_ for details. Equality ======== Two sourced strings are considered equal if their contents are equal, even if their sources differ: >>> newt_doc[3], newt_doc[10] (' '@[3], ' '@[10]) >>> newt_doc[3] == newt_doc[10] True Sourced strings may also be compared for equality with non-sourced strings: >>> newt == 'newt!' True >>> cello == "Does Newton like my Cello?" True The fact that string equality ignores sources is important in ensuring that sourced strings act like normal strings. In particular, it allows sourced strings to be used with code that was originally intended to process plain Python strings. E.g., this fact allows sourced strings to be parsed by standard parsing algorithms (which have no knowledge of sourced strings). If you wish to determine whether two simple sourced strings correspond to the same location in a document, simply compare their ``source`` attribute: >>> x = newt_doc[4:10] >>> y = newt_doc.split()[1] >>> z = x.upper() >>> (x, y, z) ('turned'@[4:10], 'turned'@[4:10], 'TURNED'@[4:10]) >>> x==y, x.source==y.source (True, True) >>> x==z, x.source==z.source (False, True) If you may be dealing with compound sourced strings, then you should use the ``sources`` attribute instead. This attribute is defined for both simple and compound sourced strings, and contains a sorted tuple of ``(index,source)`` pairs. Each such pair specifies that the source of the substring starting at ``index``, and extending ``len(source)`` characters, is ``source``: >>> newt.sources ((0, StringSource('newt.txt', begin=21, end=26)),) >>> cello.sources # doctest: +NORMALIZE_WHITESPACE ((5, StringSource('newt.txt', begin=21, end=25)), (20, StringSource('hello.txt', begin=0, end=0)), (21, StringSource('hello.txt', begin=1, end=5))) If you wish to compare two strings, and they might be simple sourced strings, compound sourced strings, or plain Python strings, then you can use ``getattr(s, 'sources', ())``, which will return ``s.sources`` for sourced strings, and ``()`` for plain Python strings: >>> print getattr(cello[:4], 'sources', ()) () >>> print getattr(cello[5:9], 'sources', ()) ((0, StringSource('newt.txt', begin=21, end=25)),) >>> print getattr(cello[17:], 'sources', ()) # doctest: +NORMALIZE_WHITESPACE ((3, StringSource('hello.txt', begin=0, end=0)), (4, StringSource('hello.txt', begin=1, end=5))) Sourced Strings as Dictionary Keys and Set Values ================================================== When sourced strings are used as dictionary keys, or placed in sets, we would sometimes like to ensure that strings with different sources are treated as different values. However, the fact that sourced string equality ignores sources makes this impossible. To get around this problem, you can use the sourced string's ``source`` (or ``sources`` for compound strings), or a tuple containing the sourced string and its ``source``, as a dictionary key or set value: >>> animals_contents = 'the dog and the cat' >>> animals = SourcedString(animals_contents, source='animals.txt') >>> # Create a list of words, including some case-normalized duplicates >>> words = animals.split() + re.findall('DOG|CAT', animals.upper()) >>> sorted(words) # doctest: +NORMALIZE_WHITESPACE ['CAT'@[16:19], 'DOG'@[4:7], 'and'@[8:11], 'cat'@[16:19], 'dog'@[4:7], 'the'@[0:3], 'the'@[12:15]] >>> # Show the set of unique words (using string equality). Note >>> # that the second occurence of 'the' was discarded. >>> sorted(set(words)) # doctest: +NORMALIZE_WHITESPACE ['CAT'@[16:19], 'DOG'@[4:7], 'and'@[8:11], 'cat'@[16:19], 'dog'@[4:7], 'the'@[0:3]] >>> # Show the set of locations where words occur. Note that >>> # the locations of 'cat' and 'dog' each appear only once. >>> sorted(set(word.source for word in words)) # doctest: +NORMALIZE_WHITESPACE [StringSource('animals.txt', begin=0, end=3), StringSource('animals.txt', begin=4, end=7), StringSource('animals.txt', begin=8, end=11), StringSource('animals.txt', begin=12, end=15), StringSource('animals.txt', begin=16, end=19)] >>> # Show the set of unique (string, location) pairs. Note >>> # that both occurences of 'the' appear; and that both copies >>> # of 'dog' and 'cat' appear. >>> sorted(set((word.source, word) for word in words)) # doctest: +NORMALIZE_WHITESPACE [(StringSource('animals.txt', begin=0, end=3), 'the'@[0:3]), (StringSource('animals.txt', begin=4, end=7), 'DOG'@[4:7]), (StringSource('animals.txt', begin=4, end=7), 'dog'@[4:7]), (StringSource('animals.txt', begin=8, end=11), 'and'@[8:11]), (StringSource('animals.txt', begin=12, end=15), 'the'@[12:15]), (StringSource('animals.txt', begin=16, end=19), 'CAT'@[16:19]), (StringSource('animals.txt', begin=16, end=19), 'cat'@[16:19])] Limitations =========== Some types of string manipulation can cause source information to be lost. In particular, functions and methods that accesses a sourced string using the low-level "buffer" interface will often bypass the sourced string's ability to preserve source information. Operations that are known to result in a loss of source information are listed below: - ``str.join()``, where the joining string is not sourced: >>> '+'.join(sent.split()) 'She+turned+me+into+a+newt!' - ``str.replace()``, where the original string is not sourced: >>> turned = newt_doc.split()[1] >>> 'I twisted around'.replace('twisted', turned) 'I turned around' - String formatting, where the format string is not sourced: >>> 'My %s is %s' % (newt_doc[21:25], newt_doc[33:39]) 'My newt is better' - Regular expression substitution, where the regular expression pattern string is not sourced: >>> re.sub('orange', 'green', better_newt) 'My green newt is better' >>> re.subn('orange', 'green', better_newt, 1) ('My green newt is better', 1) - String justification methods, where the string being justified is unsourced but the fill character is sourced: >>> 'coconut'.center(25, newt[-1]) '!!!!!!!!!coconut!!!!!!!!!' >>> 'coconut'.ljust(25, newt[-1]) 'coconut!!!!!!!!!!!!!!!!!!' >>> 'coconut'.rjust(25, newt[-1]) '!!!!!!!!!!!!!!!!!!coconut' .. ======================= Regression Tests =========================== Regression Tests ================ String Sources -------------- ConsecutiveCharStringSource ~~~~~~~~~~~~~~~~~~~~~~~~~~~ String representations: >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 18) >>> repr(source) "StringSource('coconut.txt', begin=5, end=18)" >>> str(source) '@coconut.txt[5:18]' Attributes: >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 18) >>> source.begin, source.end, source.docid (5, 18, 'coconut.txt') >>> source.docid 'coconut.txt' >>> source.offsets (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18) Begin and end must be integers (or longs): >>> ConsecutiveCharStringSource('coconut.txt', 5, 10) StringSource('coconut.txt', begin=5, end=10) >>> ConsecutiveCharStringSource('coconut.txt', 5L, 10L) StringSource('coconut.txt', begin=5L, end=10L) >>> ConsecutiveCharStringSource('coconut.txt', 5.3, 10) Traceback (most recent call last): . . . TypeError: begin attribute expected an integer >>> ConsecutiveCharStringSource('coconut.txt', 5, 10.3) Traceback (most recent call last): . . . TypeError: end attribute expected an integer >>> ConsecutiveCharStringSource('coconut.txt', 'five', 10) Traceback (most recent call last): . . . TypeError: begin attribute expected an integer >>> ConsecutiveCharStringSource('coconut.txt', 5, 'ten') Traceback (most recent call last): . . . TypeError: end attribute expected an integer The end index must be greater than or equal to the begin offset: >>> ConsecutiveCharStringSource('coconut.txt', 5, 6) StringSource('coconut.txt', begin=5, end=6) >>> ConsecutiveCharStringSource('coconut.txt', 5, 5) StringSource('coconut.txt', begin=5, end=5) >>> ConsecutiveCharStringSource('coconut.txt', 5, 4) Traceback (most recent call last): . . . ValueError: begin must be less than or equal to end The begin and end offsets may be negative: >>> ConsecutiveCharStringSource('coconut.txt', -5, 5) StringSource('coconut.txt', begin=-5, end=5) >>> ConsecutiveCharStringSource('coconut.txt', -5, -2) StringSource('coconut.txt', begin=-5, end=-2) Length-1 source: >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 6) >>> repr(source) "StringSource('coconut.txt', begin=5, end=6)" >>> str(source) '@coconut.txt[5]' >>> len(source) 1 >>> source.begin, source.end, source.docid (5, 6, 'coconut.txt') >>> source.offsets (5, 6) zero-length source: >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 5) >>> repr(source) "StringSource('coconut.txt', begin=5, end=5)" >>> str(source) '@coconut.txt[5:5]' >>> len(source) 0 >>> source.begin, source.end, source.docid (5, 5, 'coconut.txt') >>> source.offsets (5,) Indexing: >>> source = ConsecutiveCharStringSource('coconut.txt', 15, 21) >>> for i in range(-len(source), len(source)): ... print ' source[%2d] = %r' % (i, source[i]) source[-6] = StringSource('coconut.txt', begin=15, end=16) source[-5] = StringSource('coconut.txt', begin=16, end=17) source[-4] = StringSource('coconut.txt', begin=17, end=18) source[-3] = StringSource('coconut.txt', begin=18, end=19) source[-2] = StringSource('coconut.txt', begin=19, end=20) source[-1] = StringSource('coconut.txt', begin=20, end=21) source[ 0] = StringSource('coconut.txt', begin=15, end=16) source[ 1] = StringSource('coconut.txt', begin=16, end=17) source[ 2] = StringSource('coconut.txt', begin=17, end=18) source[ 3] = StringSource('coconut.txt', begin=18, end=19) source[ 4] = StringSource('coconut.txt', begin=19, end=20) source[ 5] = StringSource('coconut.txt', begin=20, end=21) >>> source[len(source)] Traceback (most recent call last): . . . IndexError: StringSource index out of range >>> source[-len(source)-1] Traceback (most recent call last): . . . IndexError: StringSource index out of range Slicing: >>> def slice_test(source, *indices): ... """Print a table showing the result of slicing the given ... source, using each of the given indices as a start or end ... index for the slice.""" ... print ' |'+' '.join(str(j).center(5) for j in indices) ... print '-----+'+'------'*len(indices) ... for i in indices: ... print '%4s |' % i, ... for j in indices: ... if i is None and j is None: sliced_source = source[:] ... elif i is None: sliced_source = source[:j] ... elif j is None: sliced_source = source[i:] ... else: sliced_source = source[i:j] ... print '%2s:%-2s' % (sliced_source.begin, sliced_source.end), ... assert sliced_source.docid == 'coconut.txt' ... print ... >>> source = ConsecutiveCharStringSource('coconut.txt', 15, 28) >>> slice_test(source, None, 0, 1, len(source)-1, len(source), 100, ... -1, -len(source)+1, -len(source), -100) | None 0 1 12 13 100 -1 -12 -13 -100 -----+------------------------------------------------------------ None | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 0 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 1 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16 12 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27 13 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 100 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 -1 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27 -12 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16 -13 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 -100 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 >>> source = ConsecutiveCharStringSource('coconut.txt', 50, 53) >>> slice_test(source, -4, -3, 0, -2, 1, -1, 2, 3, 4, 5) | -4 -3 0 -2 1 -1 2 3 4 5 -----+------------------------------------------------------------ -4 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53 -3 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53 0 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53 -2 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53 1 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53 -1 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53 2 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53 3 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 4 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 5 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 ContiguousCharStringSource ~~~~~~~~~~~~~~~~~~~~~~~~~~ String representations: >>> offsets = [5, 7, 8, 9, 13, 18] >>> source = ContiguousCharStringSource('coconut.txt', offsets) >>> repr(source) "StringSource('coconut.txt', offsets=(5, 7, 8, 9, 13, 18))" >>> str(source) '@coconut.txt[5:18]' Attributes: >>> source = ContiguousCharStringSource('coconut.txt', offsets) >>> source.begin, source.end, source.docid (5, 18, 'coconut.txt') >>> source.docid 'coconut.txt' >>> source.offsets (5, 7, 8, 9, 13, 18) Offsets must be integers (or longs): >>> ContiguousCharStringSource.CONSTRUCTOR_CHECKS_OFFSETS = True >>> ContiguousCharStringSource('coconut.txt', [5, 6L, 7]) StringSource('coconut.txt', offsets=(5, 6L, 7)) >>> ContiguousCharStringSource('coconut.txt', [6.2]) Traceback (most recent call last): . . . TypeError: offsets must be integers >>> ContiguousCharStringSource('coconut.txt', ['five']) Traceback (most recent call last): . . . TypeError: offsets must be integers Offsets must be monotonic increasing: >>> ContiguousCharStringSource('coconut.txt', [5, 6, 7]) StringSource('coconut.txt', offsets=(5, 6, 7)) >>> ContiguousCharStringSource('coconut.txt', [5, 5]) StringSource('coconut.txt', offsets=(5, 5)) >>> ContiguousCharStringSource('coconut.txt', [7, 6, 5]) Traceback (most recent call last): . . . TypeError: offsets must be monotonic increasing Offsets may be negative: >>> ContiguousCharStringSource('coconut.txt', [-5, 5]) StringSource('coconut.txt', offsets=(-5, 5)) >>> ContiguousCharStringSource('coconut.txt', [-5, -2]) StringSource('coconut.txt', offsets=(-5, -2)) Length-1 source: >>> source = ContiguousCharStringSource('coconut.txt', [5,6]) >>> repr(source) "StringSource('coconut.txt', offsets=(5, 6))" >>> str(source) '@coconut.txt[5]' >>> len(source) 1 >>> source.begin, source.end, source.docid (5, 6, 'coconut.txt') >>> source.offsets (5, 6) zero-length source: >>> source = ContiguousCharStringSource('coconut.txt', [5]) >>> repr(source) "StringSource('coconut.txt', offsets=(5,))" >>> str(source) '@coconut.txt[5:5]' >>> len(source) 0 >>> source.begin, source.end, source.docid (5, 5, 'coconut.txt') >>> source.offsets (5,) Indexing: >>> source = ContiguousCharStringSource('coconut.txt', range(15, 22)) >>> for i in range(-len(source), len(source)): ... print ' source[%2d] = %r' % (i, source[i]) source[-6] = StringSource('coconut.txt', offsets=(15, 16)) source[-5] = StringSource('coconut.txt', offsets=(16, 17)) source[-4] = StringSource('coconut.txt', offsets=(17, 18)) source[-3] = StringSource('coconut.txt', offsets=(18, 19)) source[-2] = StringSource('coconut.txt', offsets=(19, 20)) source[-1] = StringSource('coconut.txt', offsets=(20, 21)) source[ 0] = StringSource('coconut.txt', offsets=(15, 16)) source[ 1] = StringSource('coconut.txt', offsets=(16, 17)) source[ 2] = StringSource('coconut.txt', offsets=(17, 18)) source[ 3] = StringSource('coconut.txt', offsets=(18, 19)) source[ 4] = StringSource('coconut.txt', offsets=(19, 20)) source[ 5] = StringSource('coconut.txt', offsets=(20, 21)) >>> source[len(source)] Traceback (most recent call last): . . . IndexError: StringSource index out of range >>> source[-len(source)-1] Traceback (most recent call last): . . . IndexError: StringSource index out of range Slicing: >>> def slice_test(source, *indices): ... """Print a table showing the result of slicing the given ... source, using each of the given indices as a start or end ... index for the slice.""" ... print ' |'+' '.join(str(j).center(5) for j in indices) ... print '-----+'+'------'*len(indices) ... for i in indices: ... print '%4s |' % i, ... for j in indices: ... if i is None and j is None: sliced_source = source[:] ... elif i is None: sliced_source = source[:j] ... elif j is None: sliced_source = source[i:] ... else: sliced_source = source[i:j] ... print '%2s:%-2s' % (sliced_source.begin, sliced_source.end), ... assert sliced_source.docid == 'coconut.txt' ... print ... >>> source = ContiguousCharStringSource('coconut.txt', range(15, 29)) >>> slice_test(source, None, 0, 1, len(source)-1, len(source), 100, ... -1, -len(source)+1, -len(source), -100) | None 0 1 12 13 100 -1 -12 -13 -100 -----+------------------------------------------------------------ None | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 0 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 1 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16 12 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27 13 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 100 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 -1 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27 -12 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16 -13 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 -100 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15 >>> source = ContiguousCharStringSource('coconut.txt', range(50, 54)) >>> slice_test(source, -4, -3, 0, -2, 1, -1, 2, 3, 4, 5) | -4 -3 0 -2 1 -1 2 3 4 5 -----+------------------------------------------------------------ -4 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53 -3 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53 0 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53 -2 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53 1 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53 -1 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53 2 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53 3 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 4 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 5 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 Sourced Strings --------------- The following helper function that checks a sourced string's characters to make sure that the string they come from is what it should be. It looks in ``check.documents[docid]`` for the text of the document named ``docid``. >>> def check(sourced_string): ... for char in sourced_string: ... if isinstance(char, SourcedString): ... document = check.documents[char.docid] ... source_char = document[char.begin:char.end] ... assert (char == source_char or ... (isinstance(source_char, str) and ... isinstance(char, unicode) and ... char.decode('utf-8') == source_char)) >>> check.documents = {} Constructing string tokens: >>> from nltk.data import * >>> s = ("Good muffins cost $3.88\nin New York. Please buy me\n" ... "two of them.\n\nThanks.") >>> check.documents['muffins.txt'] = s >>> doc = SourcedString(s, source='muffins.txt') SourcedString indexing: >>> for i in 0, 1, 50, 72: ... check(doc[i]); check(doc[-i]) ... assert doc[i] == s[i] ... assert doc[-i] == s[-i] >>> doc[8], s[8] ('f'@[8], 'f') >>> doc[-1], s[-1], s[72] ('.'@[72], '.', '.') >>> doc[-5], s[-5], s[68] ('a'@[68], 'a', 'a') >>> doc[74] Traceback (most recent call last): . . . IndexError: string index out of range >>> doc[-74] Traceback (most recent call last): . . . IndexError: string index out of range >>> in_new_york = doc[-49:-38]; in_new_york 'in New York'@[24:35] >>> doc[27], in_new_york[3], s[27] ('N'@[27], 'N'@[27], 'N') SourcedString slicing: >>> len(in_new_york) 11 >>> def test_slice(sstring, string, print_indices, start, stop): ... check(sstring) ... assert (sstring == string) ... if (start in print_indices and stop in print_indices): ... s_repr = re.sub(r'^(.{30}).*(.{15})$', r'\1...\2', ... repr(sstring)) ... print 's[%4s:%4s] = %r' % (start, stop, s_repr) >>> def test_slices(sstring, string, test_indices, print_indices): ... test_slice(sstring[:], string[:], print_indices, '', '') ... for i in test_indices: ... test_slice(sstring[i:], string[i:], print_indices, i, '') ... test_slice(sstring[:i], string[:i], print_indices, '', i) ... for start in test_indices: ... for stop in test_indices: ... test_slice(sstring[start:stop], string[start:stop], ... print_indices, start, stop) >>> test_slices(in_new_york, 'in New York', ... range(-12, 13)+[None,100,-100, -20, 20], ... ('', 0, 1, -1, 5)) s[ : ] = "'in New York'@[24:35]" s[ -1: ] = "'k'@[34]" s[ : -1] = "'in New Yor'@[24:34]" s[ 0: ] = "'in New York'@[24:35]" s[ : 0] = "''@[24:24]" s[ 1: ] = "'n New York'@[25:35]" s[ : 1] = "'i'@[24]" s[ 5: ] = "'w York'@[29:35]" s[ : 5] = "'in Ne'@[24:29]" s[ -1: -1] = "''@[34:34]" s[ -1: 0] = "''@[34:34]" s[ -1: 1] = "''@[34:34]" s[ -1: 5] = "''@[34:34]" s[ 0: -1] = "'in New Yor'@[24:34]" s[ 0: 0] = "''@[24:24]" s[ 0: 1] = "'i'@[24]" s[ 0: 5] = "'in Ne'@[24:29]" s[ 1: -1] = "'n New Yor'@[25:34]" s[ 1: 0] = "''@[25:25]" s[ 1: 1] = "''@[25:25]" s[ 1: 5] = "'n Ne'@[25:29]" s[ 5: -1] = "'w Yor'@[29:34]" s[ 5: 0] = "''@[29:29]" s[ 5: 1] = "''@[29:29]" s[ 5: 5] = "''@[29:29]" >>> check(in_new_york[:]) >>> assert (in_new_york[:] == ... 'in New York'[:]) >>> for i in range(-12, 13)+[None,100,-100, -20, 20]: ... check(in_new_york[:i]) ... check(in_new_york[i:]) ... assert (in_new_york[i:] == 'in New York'[i:]) ... assert (in_new_york[:i] == 'in New York'[:i]) Misc other tests: >>> doc[5:12] 'muffins'@[5:12] >>> doc[:4] 'Good'@[0:4] >>> doc[-7:] 'Thanks.'@[66:73] >>> doc[-7:-1] 'Thanks'@[66:72] >>> doc[-46:-38] 'New York'@[27:35] >>> tok = doc[-49:-38] >>> tok[:] 'in New York'@[24:35] >>> tok[:2] 'in'@[24:26] >>> tok[3:] 'New York'@[27:35] >>> tok[3:4] 'N'@[27] When a token slice is taken, and the step is not 1, a plain unicode string is returned: >>> tok[::-1] 'kroY weN ni' >>> tok[1:-1:2] 'nNwYr' Regular expressions can be used to search SourcedStrings: >>> import re >>> intoks = re.findall('in', doc) >>> print intoks ['in'@[9:11], 'in'@[24:26]] Two tokens with the same string contents compare equal even if their source/begin/end differ: >>> intoks[0] == intoks[1] True Sourced strings can also be compared for equality with simple strings: >>> intoks[0] == 'in' True Case manipulation: >>> tok.capitalize() 'In new york'@[24:35] >>> tok.lower() 'in new york'@[24:35] >>> tok.upper() 'IN NEW YORK'@[24:35] >>> tok.swapcase() 'IN nEW yORK'@[24:35] >>> tok.title() 'In New York'@[24:35] Stripping: >>> wstok = SourcedString(u' Test ', 'source') >>> wstok.lstrip() u'Test '@[3:10] >>> wstok.rstrip() u' Test'@[0:7] >>> wstok.strip() u'Test'@[3:7] Splitting: >>> doc.split() # doctest: +NORMALIZE_WHITESPACE ['Good'@[0:4], 'muffins'@[5:12], 'cost'@[13:17], '$3.88'@[18:23], 'in'@[24:26], 'New'@[27:30], 'York.'@[31:36], 'Please'@[38:44], 'buy'@[45:48], 'me'@[49:51], 'two'@[52:55], 'of'@[56:58], 'them.'@[59:64], 'Thanks.'@[66:73]] >>> doc.split(None, 5) # doctest: +NORMALIZE_WHITESPACE ['Good'@[0:4], 'muffins'@[5:12], 'cost'@[13:17], '$3.88'@[18:23], 'in'@[24:26], 'New York. Please buy me\ntwo of them.\n\nThanks.'@[27:73]] >>> doc.split('\n') # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88'@[0:23], 'in New York. Please buy me'@[24:51], 'two of them.'@[52:64], ''@[65:65], 'Thanks.'@[66:73]] >>> doc.split('\n', 1) # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88'@[0:23], 'in New York. Please buy me\ntwo of them.\n\nThanks.'@[24:73]] >>> doc.rsplit() # doctest: +NORMALIZE_WHITESPACE ['Good'@[0:4], 'muffins'@[5:12], 'cost'@[13:17], '$3.88'@[18:23], 'in'@[24:26], 'New'@[27:30], 'York.'@[31:36], 'Please'@[38:44], 'buy'@[45:48], 'me'@[49:51], 'two'@[52:55], 'of'@[56:58], 'them.'@[59:64], 'Thanks.'@[66:73]] >>> doc.rsplit(None, 5) # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88\nin New York. Please buy'@[0:48], 'me'@[49:51], 'two'@[52:55], 'of'@[56:58], 'them.'@[59:64], 'Thanks.'@[66:73]] >>> doc.rsplit('\n') # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88'@[0:23], 'in New York. Please buy me'@[24:51], 'two of them.'@[52:64], ''@[65:65], 'Thanks.'@[66:73]] >>> doc.rsplit('\n', 3) # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88\nin New York. Please buy me'@[0:51], 'two of them.'@[52:64], ''@[65:65], 'Thanks.'@[66:73]] Adding adjacent string tokens gives new string tokens: >>> doc[:4] + doc[4:12] 'Good muffins'@[0:12] Adding empty strings to string tokens gives string tokens: >>> tok + '' 'in New York'@[24:35] >>> '' + tok 'in New York'@[24:35] All other add operations give basic strings: >>> 'not '+tok 'not in New York'@[...,24:35] >>> doc[:4] + doc[12:17] 'Good cost'@[0:4,12:17] Regexps: >>> sent = newt_doc.split('\n')[1] >>> re.sub('better', 'worse', sent) 'I got worse.' >>> SourcedStringRegexp('better').sub('worse', sent) 'I got worse.'@[27:33,...,39:40] >>> SourcedStringRegexp.patch_re_module() >>> re.sub('better', 'worse', sent) 'I got worse.'@[27:33,...,39:40] >>> SourcedStringRegexp.unpatch_re_module() >>> re.sub('better', 'worse', sent) 'I got worse.' Str/Unicode Interactions ------------------------ >>> x = SourcedString('byte string \xcc', 'str') >>> y = SourcedString(u'unicode string \ubbbb', 'unicode') Any operation that combines a byte string with a unicode string will first decode the byte string using the default encoding. As a result, all of the following operations raise an exception (since the string ``x`` can't be decoded using the ASCII encoding): >>> x+y Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y+x Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.__radd__(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.__radd__(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.find(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.find(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.lstrip(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.lstrip(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.rstrip(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.rstrip(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.strip(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.strip(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.split(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.split(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.rsplit(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.rsplit(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.partition(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.partition(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.rpartition(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.rpartition(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.join(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.join(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128) >>> x.center(100, y[-1]) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.center(100, x[-1]) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128) >>> x.ljust(100, y[-1]) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.ljust(100, x[-1]) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128) >>> x.rjust(100, y[-1]) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.rjust(100, x[-1]) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128) >>> x.find(y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.find(x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.replace('x', y) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x.replace(y, 'x') Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.replace('x', x) Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> y.replace(x, 'x') Traceback (most recent call last): . . . UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128) >>> x = SourcedString('ascii byte string', 'str') >>> y = SourcedString(u'unicode string \ubbbb', 'unicode') But these will all work, because x is ASCII: >>> x+y u'ascii byte stringunicode string \ubbbb'@[0:17,0:16] >>> y+x u'unicode string \ubbbbascii byte string'@[0:16,0:17] >>> x.__radd__(y) u'unicode string \ubbbbascii byte string'@[0:16,0:17] >>> y.__radd__(x) u'ascii byte stringunicode string \ubbbb'@[0:17,0:16] >>> x.find(y) -1 >>> y.find(x) -1 >>> x.lstrip(y) 'ascii byte string'@[0:17] >>> y.lstrip(x) u'unicode string \ubbbb'@[0:16] >>> x.rstrip(y) 'ascii by'@[0:8] >>> y.rstrip(x) u'unicode string \ubbbb'@[0:16] >>> x.strip(y) 'ascii by'@[0:8] >>> y.strip(x) u'unicode string \ubbbb'@[0:16] >>> x.split(y) [u'ascii byte string'@[0:17]] >>> y.split(x) [u'unicode string \ubbbb'@[0:16]] >>> x.rsplit(y) [u'ascii byte string'@[0:17]] >>> y.rsplit(x) [u'unicode string \ubbbb'@[0:16]] >>> x.partition(y) ('ascii byte string'@[0:17], ''@[17:17], ''@[17:17]) >>> y.partition(x) (u'unicode string \ubbbb'@[0:16], u''@[16:16], u''@[16:16]) >>> x.rpartition(y) ('ascii byte string'@[0:17], ''@[17:17], ''@[17:17]) >>> y.rpartition(x) (u''@[0:0], u''@[0:0], u'unicode string \ubbbb'@[0:16]) >>> x.join(y) # doctest: +ELLIPSIS u'uascii byte stringnascii byte stringiascii byte stri...5:16] >>> y.join(x) # doctest: +ELLIPSIS u'aunicode string \ubbbbsunicode string \ubbbbcunicode...6:17] >>> x.center(100, y[-1]) # doctest: +ELLIPSIS u'\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubb...5:16] >>> y.center(100, x[-1]) # doctest: +ELLIPSIS u'ggggggggggggggggggggggggggggggggggggggggggunicode st...6:17] >>> x.ljust(100, y[-1]) # doctest: +ELLIPSIS u'ascii byte string\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbb...5:16] >>> y.ljust(100, x[-1]) # doctest: +ELLIPSIS u'unicode string \ubbbbggggggggggggggggggggggggggggggg...6:17] >>> x.rjust(100, y[-1]) # doctest: +ELLIPSIS u'\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubb...0:17] >>> y.rjust(100, x[-1]) # doctest: +ELLIPSIS u'gggggggggggggggggggggggggggggggggggggggggggggggggggg...0:16] >>> x.find(y) -1 >>> y.find(x) -1 >>> x.replace('x', y) u'ascii byte string'@[0:17] >>> x.replace(y, 'x') u'ascii byte string'@[0:17] >>> y.replace('x', x) u'unicode string \ubbbb'@[0:16] >>> y.replace(x, 'x') u'unicode string \ubbbb'@[0:16] Translate >>> table = [chr(i) for i in range(256)] >>> table[ord('e')] = '3' >>> table[ord('!')] = '|' >>> newt.translate(''.join(table)) 'n3wt|'@[21:26] >>> newt.translate(''.join(table), 'n3t') '3w|'@[22:24,25:26] >>> newt.decode().translate({'e':'3', '!':'*'}) u'n3wt*'@[21:26] >>> newt.decode().translate({'e':'3', '!':'*', 'w': None}) u'n3t*'@[21:23,24:26]