Question 1

一种方法是使用findall，并使用一个能贪婪地匹配可以在分隔符之间的东西的重码，例如。

>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
上面的反义词匹配一个或多个。
non-comma, non-open-paren characters
strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren
这种方法的一个怪癖是，相邻的分隔符被视为一个单一的分隔符。也就是说，你不会看到一个空字符串。这可能是一个错误，也可能是一个特点，取决于你的使用情况。
还要注意的是，重词是not适用于有可能出现嵌套的情况。因此，例如，这将错误地分割。
"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"
如果你需要处理嵌套问题，你最好的办法是将字符串分割成括号、逗号和其他东西（基本上是标记化 -- 这一部分仍然可以用重码来完成），然后通过这些标记重新组合字段，一边走一边跟踪你的嵌套级别（这种跟踪嵌套级别是重码自己无法做到的）。

Question 2

s = re.split(r',\s*(?=[^)]*(?:\(|$))', x) lookahead将所有内容都匹配到下一个开放括号，或匹配到字符串的结尾。iff中间没有封闭括号。这确保了逗号不在一组括号内。

Question 3

我认为处理这个问题的最好方法是使用python内置的csv module. 因为csv模块只允许一个字符quotechar，你需要对你的输入进行替换，将()转换为类似|或"的东西。然后确保你使用的是适当的方言，就可以了。

Question 4

一个关于人类可读的重码的尝试。 import re regex = re.compile(r""" # name starts and ends on word boundary # no '(' or commas in the name (?P<name>\b[^(,]+\b) # everything inside parentheses is a role (?:$ (?P<role>[^)]+) $)? # role is optional """, re.VERBOSE) s = ("Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley," "Jane Doe (Jane Doe)") print re.findall(regex, s) Output: [('Wilbur Smith', 'Billy, son of John'), ('Eddie Murphy', 'John'), ('Elvis Presley', ''), ('Jane Doe', 'Jane Doe')]

Question 5

我的答案将不使用regex。我认为简单的字符扫描器与状态"in_actor_name"应该可以工作。记住，状态"in_actor_name"在这个状态下是以'）'或逗号结束的。 My try: s = 'Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)' in_actor_name = 1 role = '' name = '' for c in s: if c == ')' or (c == ',' and in_actor_name): in_actor_name = 1 name = name.strip() if name: print "%s: %s" % (name, role) name = '' role = '' elif c == '(': in_actor_name = 0 else: if in_actor_name: name += c else: role += c if name: print "%s: %s" % (name, role) Output: Wilbur Smith: Billy, son of John Eddie Murphy: John Elvis Presley: Jane Doe: Jane Doe

Question 6

以下是我过去在这种情况下使用的一般技巧。使用sub模块的re函数，以函数作为替换参数。该函数会跟踪开括号和闭括号、大括号和小括号以及单引号和双引号，并且只在这些括号和引号的子串之外进行替换。然后，你可以用另一个你确定不会出现在字符串中的字符（我使用ASCII/Unicode分组分隔符：chr(29)代码）来替换非括号/引号的逗号，然后对该字符进行简单的字符串分割。以下是代码。 import re def srchrepl(srch, repl, string): """Replace non-bracketed/quoted occurrences of srch with repl in string""" resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>[""" + srch + """])|(?P<rbrkt>[)\]}])""") return resrchrepl.sub(_subfact(repl), string) def _subfact(repl): """Replacement function factory for regex sub method in srchrepl.""" level = 0 qtflags = 0 def subf(mo): nonlocal level, qtflags sepfound = mo.group('sep') if sepfound: if level == 0 and qtflags == 0: return repl else: return mo.group(0) elif mo.group('lbrkt'): level += 1 return mo.group(0) elif mo.group('quote') == "'": qtflags ^= 1 # toggle bit 1 return "'" elif mo.group('quote') == '"': qtflags ^= 2 # toggle bit 2 return '"' elif mo.group('rbrkt'): level -= 1 return mo.group(0) return subf 如果你的Python版本中没有nonlocal，只要把它改为global，并在模块级定义level和qtflags。 Here's how it's used: >>> GRPSEP = chr(29) >>> string = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)" >>> lst = srchrepl(',', GRPSEP, string).split(GRPSEP) ['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

Question 7

这个帖子对我帮助很大。我正在寻找通过逗号定位外引号来分割一个字符串。我把这个作为一个开端。我的最后一行代码是regEx = re.compile(r'(?:[^,"]|"[^"]*")+')。这就成功了。非常感谢。

Question 8

我当然同意上面 @Wogan 的意见，使用 CSV moudle 是一个好方法。尽管如此，如果你仍然想尝试一个重合的解决方案，可以试试这个，但你必须把它改成Python方言。 string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

Question 9

split by ")" >>> s="Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)" >>> s.split(")") ['Wilbur Smith (Billy, son of John', ', Eddie Murphy (John', ', Elvis Presley, Jane Doe (Jane Doe', ''] >>> for i in s.split(")"): ... print i.split("(") ['Wilbur Smith ', 'Billy, son of John'] [', Eddie Murphy ', 'John'] [', Elvis Presley, Jane Doe ', 'Jane Doe'] 你可以做进一步的检查，以获得那些不带有（）的名字。

Question 10

如果你的数据中存在任何错误或噪音，上述答案都不正确。如果你知道每次的数据都是正确的，就很容易想出一个好的解决方案。但是，如果有格式错误会怎样？你希望发生什么？