适用于 Python 的 Azure 文本分析客户端库

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
创建认知服务或语言服务资源

语言服务支持多服务和单服务访问。如果计划通过一个终结点/密钥访问多个认知服务，请创建认知服务资源。对于“仅限语言服务访问”，请创建语言服务资源。
可以使用创建资源
选项 2： Azure CLI 。下面是如何使用 CLI 创建语言服务资源的示例：
# Create a new resource group to hold the Language service resource -
# if using an existing resource group, skip this step
az group create --name my-resource-group --location westus2
# Create text analytics
az cognitiveservices account create \
    --name text-analytics-resource \
    --resource-group my-resource-group \
    --kind TextAnalytics \
    --sku F0 \
    --location westus2 \
    --yes
使用客户端库与服务的交互从 客户端开始。
若要创建客户端对象，需要资源的认知服务或语言服务 endpoint ，以及 credential 允许访问：
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
text_analytics_client = TextAnalyticsClient(endpoint="https://<resource-name>.cognitiveservices.azure.com/", credential=credential)
请注意，对于某些认知服务资源，终结点可能看起来与上述代码片段不同。
例如 https://<region>.api.cognitive.microsoft.com/。
使用 pip 安装适用于 Python 的 Azure 文本分析 客户端库：
pip install azure-ai-textanalytics
请注意， 5.2.X 面向语言 API 的 Azure 认知服务。 这些 API 包括以前版本的 文本分析 客户端库中的文本分析和自然语言处理功能。
此外，服务 API 已从语义版本控制更改为基于日期的版本控制。 此版本的客户端库默认为支持的最新 API 版本，当前为 2022-05-01。
此表显示了 SDK 版本与支持的 API 版本的服务之间的关系
SDK 版本
服务支持的 API 版本
可以通过将 api_version keyword 参数传递到客户端来选择 API 版本。
对于最新的语言服务功能，请考虑选择最新的 beta API 版本。 对于生产方案，建议使用最新的稳定版本。 设置为较旧版本可能会导致功能兼容性降低。
验证客户端
获取终结点
可以使用 Azure 门户或 AzureCLI 查找语言服务资源的终结点：
# Get the endpoint for the Language service resource
az cognitiveservices account show --name "resource-name" --resource-group "resource-group-name" --query "properties.endpoint"
获取 API 密钥
可以从 Azure 门户中的认知服务或语言服务资源获取 API 密钥。
或者，可以使用下面的 Azure CLI 代码片段获取资源的 API 密钥。
az cognitiveservices account keys list --name "resource-name" --resource-group "resource-group-name"
使用 API 密钥凭据创建 TextAnalyticsClient
获得 API 密钥的值后，可以将其作为字符串传递到 AzureKeyCredential 实例中。 使用密钥作为凭据参数对客户端进行身份验证：
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
text_analytics_client = TextAnalyticsClient(endpoint="https://<resource-name>.cognitiveservices.azure.com/", credential=credential)
使用 Azure Active Directory 凭据创建 TextAnalyticsClient
若要使用 Azure Active Directory (AAD) 令牌凭据，请提供从 azure 标识 库获取的所需凭据类型的实例。
请注意，区域终结点不支持 AAD 身份验证。 为资源 创建自定义子域 名称，以便使用此类型的身份验证。
使用 AAD 进行身份验证需要一些初始设置：
安装 azure-identity
注册新的 AAD 应用程序
通过将角色分配给服务主体来"Cognitive Services Language Reader"授予对语言服务的访问权限。
设置后，可以从 azure.identity 中选择要使用的 凭据 类型。
例如， 可以使用 DefaultAzureCredential 对客户端进行身份验证：
将 AAD 应用程序的客户端 ID、租户 ID 和客户端密码的值设置为环境变量：AZURE_CLIENT_ID、AZURE_TENANT_ID AZURE_CLIENT_SECRET
使用返回的令牌凭据对客户端进行身份验证：
from azure.ai.textanalytics import TextAnalyticsClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
text_analytics_client = TextAnalyticsClient(endpoint="https://<resource-name>.cognitiveservices.azure.com/", credential=credential)
TextAnalyticsClient
文本分析客户端库提供 TextAnalyticsClient，用于对批处理进行分析。
它提供同步和异步操作来访问文本分析的特定用途，例如语言检测或关键短语提取。
文档是语言服务中的预测模型要分析的单个单元。
每个操作的输入作为文档 列表 传递。
每个文档都可以作为列表中的字符串传递，例如
documents = ["I hated the movie. It was so slow!", "The movie made it into my top ten favorites. What a great movie!"]
或者，如果要传入每个项目的文档id或 languagecountry_hint/，则可以将其作为 DetectLanguageInput 或 TextDocumentInput 的列表传递，或对象的类似 dict 的表示形式：
documents = [
    {"id": "1", "language": "en", "text": "I hated the movie. It was so slow!"},
    {"id": "2", "language": "en", "text": "The movie made it into my top ten favorites. What a great movie!"},
请参阅输入 的服务限制 ，包括文档长度限制、最大批大小和支持的文本编码。
单个文档的返回值可以是结果或错误对象。
从每个操作返回包含结果和错误对象的集合的异类列表。
这些结果/错误与提供的文档的顺序进行索引匹配。
结果（如 AnalyzeSentimentResult）是文本分析操作的结果，包含有关文档输入的预测。
错误对象 DocumentError 指示服务在处理文档时遇到问题，并包含它不成功的原因。
文档错误处理
可以使用 属性筛选列表中 is_error 的结果或错误对象。 对于结果对象，这始终 False 为 ，对于 DocumentError ，这是 True。
例如，若要筛选掉所有 DocumentErrors，可以使用列表理解：
response = text_analytics_client.analyze_sentiment(documents)
successful_responses = [doc for doc in response if not doc.is_error]
还可以使用 kind 属性在结果类型之间进行筛选：
poller = text_analytics_client.begin_analyze_actions(documents, actions)
response = poller.result()
for result in response:
    if result.kind == "SentimentAnalysis":
        print(f"Sentiment is {result.sentiment}")
    elif result.kind == "KeyPhraseExtraction":
        print(f"Key phrases: {result.key_phrases}")
    elif result.is_error is True:
        print(f"Document error: {result.code}, {result.message}")
Long-Running操作
长时间运行的操作包括发送到服务以启动操作的初始请求，然后按间隔轮询服务以确定操作是否已完成或失败，如果操作成功，则获取结果。
支持医疗保健分析、自定义文本分析或多个分析的方法将建模为长时间运行的操作。
客户端公开一个 begin_<method-name> 返回轮询器对象的方法。 调用方应通过调用 result() 从 begin_<method-name> 方法返回的轮询器对象来等待操作完成。
提供了示例代码片段来说明如何使用长时间运行的操作 。
以下部分提供了几个代码片段，涵盖了一些最常见的语言服务任务，包括：
识别链接实体
识别 PII 实体
提取关键短语
医疗保健实体分析
自定义实体识别
自定义单标签分类
自定义多标签分类
analyze_sentiment 查看其输入文本，并确定其情绪是积极、消极、中性还是混合。 它的响应包括每句情绪分析和置信度分数。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = [
    "I did not like the restaurant. The food was somehow both too spicy and underseasoned. Additionally, I thought the location was too far away from the playhouse.",
    "The restaurant was decorated beautifully. The atmosphere was unlike any other restaurant I've been to.",
    "The food was yummy. :)"
response = text_analytics_client.analyze_sentiment(documents, language="en")
result = [doc for doc in response if not doc.is_error]
for doc in result:
    print(f"Overall sentiment: {doc.sentiment}")
    print(
        f"Scores: positive={doc.confidence_scores.positive}; "
        f"neutral={doc.confidence_scores.neutral}; "
        f"negative={doc.confidence_scores.negative}\n"
返回的响应是结果和错误对象的异类列表：list[AnalyzeSentimentResult， DocumentError]
有关 情绪分析的概念性讨论，请参阅服务文档。 若要了解如何对文本中的产品或服务的属性 (（如产品或服务的属性) ）的相关观点进行更精细的分析，请参阅 此处。
recognize_entities 在其输入文本中识别实体并将其分类为人员、地点、组织、日期/时间、数量、百分比、货币等。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = [
    Microsoft was founded by Bill Gates and Paul Allen. Its headquarters are located in Redmond. Redmond is a
    city in King County, Washington, United States, located 15 miles east of Seattle.
    "Jeff bought three dozen eggs because there was a 50% discount."
response = text_analytics_client.recognize_entities(documents, language="en")
result = [doc for doc in response if not doc.is_error]
for doc in result:
    for entity in doc.entities:
        print(f"Entity: {entity.text}")
        print(f"...Category: {entity.category}")
        print(f"...Confidence Score: {entity.confidence_score}")
        print(f"...Offset: {entity.offset}")
返回的响应是结果和错误对象的异类列表：list[RecognizeEntitiesResult， DocumentError]
有关 命名实体识别 和支持 的类型的概念性讨论，请参阅服务文档。
识别链接的实体
recognize_linked_entities 识别并消除在其输入文本中找到的每个实体的身份 (例如，确定“火星”一词的出现指的是行星，还是罗马战神) 。 识别的实体与已知知识库（如维基百科）的 URL 相关联。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = [
    "Microsoft was founded by Bill Gates and Paul Allen. Its headquarters are located in Redmond.",
    "Easter Island, a Chilean territory, is a remote volcanic island in Polynesia."
response = text_analytics_client.recognize_linked_entities(documents, language="en")
result = [doc for doc in response if not doc.is_error]
for doc in result:
    for entity in doc.entities:
        print(f"Entity: {entity.name}")
        print(f"...URL: {entity.url}")
        print(f"...Data Source: {entity.data_source}")
        print("...Entity matches:")
        for match in entity.matches:
            print(f"......Entity match text: {match.text}")
            print(f"......Confidence Score: {match.confidence_score}")
            print(f"......Offset: {match.offset}")
返回的响应是结果和错误对象的异类列表：list[RecognizeLinkedEntitiesResult， DocumentError]
有关 实体链接 和支持 的类型的概念性讨论，请参阅服务文档。
识别 PII 实体
recognize_pii_entities 识别个人身份信息并将其分类， (PII 在其输入文本中) 实体，例如社会安全号码、银行帐户信息、信用卡号等。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = [
    We have an employee called Parker who cleans up after customers. The employee's
    SSN is 859-98-0987, and their phone number is 555-555-5555.
response = text_analytics_client.recognize_pii_entities(documents, language="en")
result = [doc for doc in response if not doc.is_error]
for idx, doc in enumerate(result):
    print(f"Document text: {documents[idx]}")
    print(f"Redacted document text: {doc.redacted_text}")
    for entity in doc.entities:
        print(f"...Entity: {entity.text}")
        print(f"......Category: {entity.category}")
        print(f"......Confidence Score: {entity.confidence_score}")
        print(f"......Offset: {entity.offset}")
返回的响应是结果和错误对象的异类列表：list[RecognizePiiEntitiesResult， DocumentError]
有关 支持的 PII 实体类型，请参阅服务文档。
注意：识别 PII 实体服务在 API 版本 v3.1 及更新版本中可用。
提取关键短语
extract_key_phrases 确定其输入文本中的主要谈话点。 例如，对于输入文本“食物很美味，工作人员很棒”，API 返回：“food”和“wonderful staff”。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = [
    "Redmond is a city in King County, Washington, United States, located 15 miles east of Seattle.",
    I need to take my cat to the veterinarian. He has been sick recently, and I need to take him
    before I travel to South America for the summer.
response = text_analytics_client.extract_key_phrases(documents, language="en")
result = [doc for doc in response if not doc.is_error]
for doc in result:
    print(doc.key_phrases)
返回的响应是结果和错误对象的异类列表：list[ExtractKeyPhrasesResult， DocumentError]
有关 关键短语提取的概念性讨论，请参阅服务文档。
detect_language 确定其输入文本的语言，包括预测语言的置信度分数。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = [
    This whole document is written in English. In order for the whole document to be written
    in English, every sentence also has to be written in English, which it is.
    "Il documento scritto in italiano.",
    "Dies ist in deutsche Sprache verfasst."
response = text_analytics_client.detect_language(documents)
result = [doc for doc in response if not doc.is_error]
for doc in result:
    print(f"Language detected: {doc.primary_language.name}")
    print(f"ISO6391 name: {doc.primary_language.iso6391_name}")
    print(f"Confidence score: {doc.primary_language.confidence_score}\n")
返回的响应是结果和错误对象的异类列表：list[DetectLanguageResult， DocumentError]
有关语言检测以及语言和区域支持的概念性讨论，请参阅服务文档。
医疗保健实体分析
长时间运行的操作begin_analyze_healthcare_entities 提取在医疗保健域中识别的实体，并识别输入文档中的实体之间的关系，以及各种已知数据库（如 UMLS、CHV、MSH 等）中已知信息源的链接。
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = ["Subject is taking 100mg of ibuprofen twice daily"]
poller = text_analytics_client.begin_analyze_healthcare_entities(documents)
result = poller.result()
docs = [doc for doc in result if not doc.is_error]
print("Results of Healthcare Entities Analysis:")
for idx, doc in enumerate(docs):
    for entity in doc.entities:
        print(f"Entity: {entity.text}")
        print(f"...Normalized Text: {entity.normalized_text}")
        print(f"...Category: {entity.category}")
        print(f"...Subcategory: {entity.subcategory}")
        print(f"...Offset: {entity.offset}")
        print(f"...Confidence score: {entity.confidence_score}")
        if entity.data_sources is not None:
            print("...Data Sources:")
            for data_source in entity.data_sources:
                print(f"......Entity ID: {data_source.entity_id}")
                print(f"......Name: {data_source.name}")
        if entity.assertion is not None:
            print("...Assertion:")
            print(f"......Conditionality: {entity.assertion.conditionality}")
            print(f"......Certainty: {entity.assertion.certainty}")
            print(f"......Association: {entity.assertion.association}")
    for relation in doc.entity_relations:
        print(f"Relation of type: {relation.relation_type} has the following roles")
        for role in relation.roles:
            print(f"...Role '{role.name}' with entity '{role.entity.text}'")
    print("------------------------------------------")
注意：医疗保健实体分析仅适用于 API v3.1 及更新版本。
长时间运行的操作begin_analyze_actions 在单个请求中对一组文档执行多个分析。 目前，在单个请求中支持使用以下语言 API 的任意组合：
PII 实体识别
链接实体识别
关键短语提取
自定义实体识别
自定义单标签分类
自定义多标签分类
医疗保健实体分析
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import (
    TextAnalyticsClient,
    RecognizeEntitiesAction,
    AnalyzeSentimentAction,
credential = AzureKeyCredential("<api_key>")
endpoint="https://<resource-name>.cognitiveservices.azure.com/"
text_analytics_client = TextAnalyticsClient(endpoint, credential)
documents = ["Microsoft was founded by Bill Gates and Paul Allen."]
poller = text_analytics_client.begin_analyze_actions(
    documents,
    display_name="Sample Text Analysis",
    actions=[
        RecognizeEntitiesAction(),
        AnalyzeSentimentAction()
# returns multiple actions results in the same order as the inputted actions
document_results = poller.result()
for doc, action_results in zip(documents, document_results):
    print(f"\nDocument text: {doc}")
    for result in action_results:
        if result.kind == "EntityRecognition":
            print("...Results of Recognize Entities Action:")
            for entity in result.entities:
                print(f"......Entity: {entity.text}")
                print(f".........Category: {entity.category}")
                print(f".........Confidence Score: {entity.confidence_score}")
                print(f".........Offset: {entity.offset}")
        elif result.kind == "SentimentAnalysis":
            print("...Results of Analyze Sentiment action:")
            print(f"......Overall sentiment: {result.sentiment}")
            print(f"......Scores: positive={result.confidence_scores.positive}; "
                  f"neutral={result.confidence_scores.neutral}; "
                  f"negative={result.confidence_scores.negative}\n")
        elif result.is_error is True:
            print(f"......Is an error with code '{result.code}' "
                  f"and message '{result.message}'")
    print("------------------------------------------")
返回的响应是封装多个迭代的对象，每个迭代对象表示单个分析的结果。
注意：API v3.1 及更新版本中提供了多个分析。
可选关键字参数可以在客户端和每个操作级别传入。
azure-core 参考文档 介绍了重试、日志记录、传输协议等的可用配置。
文本分析客户端将引发 Azure Core 中定义的异常。
此库使用标准 日志记录 库进行日志记录。
有关 HTTP 会话 (URL、标头等的基本信息，) 在 INFO 级别记录。
在客户端上使用 logging_enable 关键字参数可启用详细的调试级别日志记录（包括请求/响应正文和未编辑的标头）：
import sys
import logging
from azure.identity import DefaultAzureCredential
from azure.ai.textanalytics import TextAnalyticsClient
# Create a logger for the 'azure' SDK
logger = logging.getLogger('azure')
logger.setLevel(logging.DEBUG)
# Configure a console output
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)
endpoint = "https://<resource-name>.cognitiveservices.azure.com/"
credential = DefaultAzureCredential()
# This client will log detailed information about its HTTP sessions, at DEBUG level
text_analytics_client = TextAnalyticsClient(endpoint, credential, logging_enable=True)
result = text_analytics_client.analyze_sentiment(["I did not like the restaurant. The food was too spicy."])
同样，即使没有为客户端启用详细日志记录，logging_enable 也可以为单个操作启用：
result = text_analytics_client.analyze_sentiment(documents, logging_enable=True)
更多示例代码
这些代码示例演示 Azure 文本分析 客户端库的常见方案操作。
使用认知服务/语言服务 API 密钥或 azure-identity 中的令牌凭据对客户端进行身份验证：
sample_authentication.py (异步版本) 
分析情绪： sample_analyze_sentiment.py (异步版本) 
识别实体： sample_recognize_entities.py (异步版本) 
识别个人身份信息： sample_recognize_pii_entities.py (异步版本) 
识别链接实体： sample_recognize_linked_entities.py (异步版本) 
提取关键短语： sample_extract_key_phrases.py (异步版本) 
检测语言： sample_detect_language.py (异步版本) 
医疗保健实体分析： sample_analyze_healthcare_entities.py (异步版本) 
多重分析： sample_analyze_actions.py (异步版本) 
自定义实体识别： sample_recognize_custom_entities.py (async_version) 
自定义单标签分类： sample_single_label_classify.py (async_version) 
自定义多标签分类： sample_multi_label_classify.py (async_version) 
观点挖掘： sample_analyze_sentiment_with_opinion_mining.py (async_version) 
有关适用于语言的 Azure 认知服务的广泛文档，请参阅有关 docs.microsoft.com 的语言服务文档 。
本项目欢迎贡献和建议。 大多数贡献要求你同意贡献者许可协议 (CLA)，并声明你有权（并且确实有权）授予我们使用你的贡献的权利。 有关详细信息，请访问 cla.microsoft.com。
提交拉取请求时，CLA 机器人将自动确定你是否需要提供 CLA，并相应地修饰 PR（例如标签、注释）。 直接按机器人提供的说明操作。 只需使用 CLA 对所有存储库执行一次这样的操作。
此项目采用了 Microsoft 开放源代码行为准则。 有关详细信息，请参阅行为准则常见问题解答，或如果有任何其他问题或意见，请与  联系。