sklearn.decomposition.LatentDirichletAllocation-scikit-learn中文社区

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

含蓄的香槟 · 什麼是 JSON？ | Oracle 台灣· 1 年前 ·

瘦瘦的棒棒糖 · 从嵌套模型填充QML ...· 1 年前 ·

深情的楼房 · 北京大学邓明华教授学术报告通知· 1 年前 ·

神勇威武的茶壶 · 使用Navicat生成ER关系图并导出_na ...· 1 年前 ·

阳光的皮带 · 工程编译时报错：”stdlib.h: ...· 1 年前 ·

class




    
 sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

基于在线变分贝叶斯算法的潜在狄利克雷分解

新版本为0.17。

在“用户指南”中阅读更多内容

n_components int, optional (default=10)
数量的话题。
在版本0.19中更改:n_topics ' '被重命名为' ' n_components doc_topic_prior float, optional (default=None)
之前的主题词分布


    theta

。如果值为None，则默认为


    1 / n_components

。在[Re25e5648fc37-1]中，这叫做


    alpha

. topic_word_prior float, optional (default=None)
之前的主题词分布beta。如果值为None，则默认为


    1 / n_components

。在[Re25e5648fc37-1]中，这被称为

eta

。 learning_method 'batch'/‘online’, default=’batch'
用于更新


    _component

的方法。仅在

fit

中使用。通常，如果数据量很大，在线更新会比批量更新快得多.
有效的选项:

“batch”:批量变分贝叶斯方法。在每个EM更新中使用所有的训练数据
旧的“components_”将在每次迭代中被覆盖。
“online”: 在线变分贝叶斯方法。在每个EM更新中，使用mini-batch更新' ' components_ ' '的训练数据
变量增量。学习率是由' ' learning_decay ' '和' ' learning_offset ' '参数控制。

在0.20版本中改变:默认的学习方法现在是“batch”。 learning_decay float, optional (default=0.7)
它是在线学习方法中控制学习率的一个参数。为保证渐近收敛，取值应在(0.5,1.0)之间。当值为0.0,


    batch_size

为


    n_samples

时，更新方法与批量学习相同。在这篇文献中，被称为kappa。 learning_offset float, optional (default=10.)
一个(正的)参数，降低在线学习的早期迭代。它应该大于1.0。在文献中，这叫做tau_0。 max_iter integer, optional (default=10)
最大迭代次数。 batch_size int, optional (default=128)
在每次EM迭代中使用的文档数量。仅用于在线学习。 evaluate_every int, optional (default=0)
评估困惑频率。仅在

fit

法中使用。将其设置为0 或负数，在训练中完全不评估perplexity。评估perplexity可以帮助你检查训练过程中的收敛性，但也会增加训练的总时间。在每次迭代中评估复杂性可能会将训练时间增加两倍。 total_samples int, optional (default=1e6)
文件总数。仅用于


     partial_fit

方法。 perp_tol float, optional (default=1e-1)
批量学习中的困惑容忍度。仅在


    evaluate_every

大于0时使用。 mean_change_tol float, optional (default=1e-3)
停止E-step中更新文档主题分发的容忍度。 max_doc_update_iter int (default=100)
E-step中更新文档主题分布的最大迭代次数。 n_jobs int or None, optional (default=None)
在E-step中使用的作业数量。None就是1，除非在


     joblib.parallel_backend

上下文。

-1

表示使用所有处理器。更多细节请参见 Glossary 。 verbose int, optional (default=0)
冗长的水平。 random_state int, RandomState instance, default=None
在多个函数调用中传递可重复的结果。参见 Glossary 。 components_ array, [n_components, n_features]
主题词分布的变分参数。自完整的词分布狄利克雷条件为话题,


    components_ (i, j)

可以被视为


    pseudocount

代表单词的次数

,我被分配到的话题。它也可以被视为分布归一化后的文字为每个主题:


    model.components_ / model.components_.sum(axis= 1):,np.newaxis]

。 n_batch_iter_ int
EM步骤的迭代次数。 n_iter_ int
传递数据集的次数。 bound_ float
训练集最终perplexity得分。 doc_topic_prior_ float
之前的主题词分布theta。。如果值为None，则为1 / n_components。 topic_word_prior_ float
之前的主题词分布beta。如果值为None，则为1 / n_components。

参考文献：

Re25e5648fc37-1( 1 , 2 )

“Online Learning for Latent Dirichlet Allocation”, Matthew D. Hoffman, David M. Blei, Francis Bach, 2010

[2] “Stochastic Variational Inference”, Matthew D. Hoffman, David M. Blei,

Chong Wang, John Paisley, 2013

[3] Matthew D. Hoffman’s onlineldavb code. Link:

https://github.com/blei-lab/onlineldavb

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import make_multilabel_classification
>>> # This produces a feature matrix of token counts, similar to what
>>> # CountVectorizer would produce on text.
>>> X, _ = make_multilabel_classification(random_state=0)
>>> lda = LatentDirichletAllocation(n_components=5,
...     random_state=0)
>>> lda.fit(X)
LatentDirichletAllocation(...)
>>> # get topics for some given samples:
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
       [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586  ]])

__init__(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

初始化self. 请参阅help(type(self))以获得准确的说明。

fit(self, X, y=None)

用变分贝叶斯方法学习数据X的模型。

当 learning_method 是“在线”时，使用小批量更新。否则，使用批处理更新。