NLP之词向量：利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)

2024-08-03 07:05:29

输出结果

寻找训练文本中与morning最相关的10个词汇：
[('afternoon', 0.8329864144325256), ('weekend', 0.7690818309783936), ('evening', 0.7469204068183899),
('saturday', 0.7191835045814514), ('night', 0.7091601490974426), ('friday', 0.6764787435531616),
('sunday', 0.6380082368850708), ('newspaper', 0.6365975737571716), ('summer', 0.6268560290336609),
('season', 0.6137701272964478)]

寻找训练文本中与email最相关的10个词汇：
[('mail', 0.7432783842086792), ('contact', 0.6995242834091187), ('address', 0.6547545194625854),
('replies', 0.6502780318260193), ('mailed', 0.6334187388420105), ('request', 0.6262195110321045),
('sas', 0.6220622658729553), ('send', 0.6207413077354431), ('listserv', 0.617364227771759),
('compuserve', 0.5954489707946777)]

设计思路

核心代码

class Word2Vec(BaseWordEmbeddingsModel):
    """Train, use and evaluate neural networks described in https://code.google.
     com/p/word2vec/.

    Once you're finished training a model (=no more updates, only querying)
    store and use only the :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `self.
     wv` to reduce memory.

    The model can be stored/loaded via its :meth:`~gensim.models.word2vec.Word2Vec.save`
     and
    :meth:`~gensim.models.word2vec.Word2Vec.load` methods.

    The trained word vectors can also be stored/loaded from a format compatible with the
    original word2vec implementation via `self.wv.save_word2vec_format`
    and :meth:`gensim.models.keyedvectors.KeyedVectors.load_word2vec_format`.

    Some important attributes are the following:

    Attributes
    ----------
    wv : :class:`~gensim.models.keyedvectors.Word2VecKeyedVectors`
    This object essentially contains the mapping between words and embeddings. After
     training, it can be used
    directly to query those embeddings in various ways. See the module level docstring for
     examples.

    vocabulary : :class:'~gensim.models.word2vec.Word2VecVocab'
    This object represents the vocabulary (sometimes called Dictionary in gensim) of the
     model.
    Besides keeping track of all unique words, this object provides extra functionality, such as
    constructing a huffman tree (frequent words are closer to the root), or discarding
     extremely rare words.

    trainables : :class:`~gensim.models.word2vec.Word2VecTrainables`
    This object represents the inner shallow neural network used to train the embeddings. The
     semantics of the
    network differ slightly in the two available training modes (CBOW or SG) but you can think
     of it as a NN with
    a single projection and hidden layer which we train on the corpus. The weights are then
     used as our embeddings
    (which means that the size of the hidden layer is equal to the number of features `self.size`).

    """
    def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
        max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
        sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5,
         null_word=0,
        trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH,
         compute_loss=False, callbacks=(),
        max_final_vocab=None):
        """

        Parameters
        ----------
        sentences : iterable of iterables, optional
            The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,
            consider an iterable that streams the sentences directly from disk/network.
            See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.
             word2vec.Text8Corpus`
            or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.
             word2vec` module for such examples.
            See also the `tutorial on data streaming in Python
            <https://rare-technologies.com/data-streaming-in-python-generators-iterators-
             iterables/>`_.
            If you don't supply `sentences`, the model is left uninitialized -- use if you plan to
             initialize it
            in some other way.
        size : int, optional
            Dimensionality of the word vectors.
        window : int, optional
            Maximum distance between the current and predicted word within a sentence.
        min_count : int, optional
            Ignores all words with total frequency lower than this.
        workers : int, optional
            Use these many worker threads to train the model (=faster training with multicore
             machines).
        sg : {0, 1}, optional
            Training algorithm: 1 for skip-gram; otherwise CBOW.
        hs : {0, 1}, optional
            If 1, hierarchical softmax will be used for model training.
            If 0, and `negative` is non-zero, negative sampling will be used.
        negative : int, optional
            If > 0, negative sampling will be used, the int for negative specifies how many "noise
             words"
            should be drawn (usually between 5-20).
            If set to 0, no negative sampling is used.
        ns_exponent : float, optional
            The exponent used to shape the negative sampling distribution. A value of 1.0
             samples exactly in proportion
            to the frequencies, 0.0 samples all words equally, while a negative value samples low-
             frequency words more
            than high-frequency words. The popular default value of 0.75 was chosen by the
             original Word2Vec paper.
            More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-
             Letelier suggest that
            other values may perform better for recommendation applications.
        cbow_mean : {0, 1}, optional
            If 0, use the sum of the context word vectors. If 1, use the mean, only applies when
             cbow is used.
        alpha : float, optional
            The initial learning rate.
        min_alpha : float, optional
            Learning rate will linearly drop to `min_alpha` as training progresses.
        seed : int, optional
            Seed for the random number generator. Initial vectors for each word are seeded with
             a hash of
            the concatenation of word + `str(seed)`. Note that for a fully deterministically-
             reproducible run,
            you must also limit the model to a single worker thread (`workers=1`), to eliminate
             ordering jitter
            from OS thread scheduling. (In Python 3, reproducibility between interpreter launches
             also requires
            use of the `PYTHONHASHSEED` environment variable to control hash randomization).
        max_vocab_size : int, optional
            Limits the RAM during vocabulary building; if there are more unique
            words than this, then prune the infrequent ones. Every 10 million word types need
             about 1GB of RAM.
            Set to `None` for no limit.
        max_final_vocab : int, optional
            Limits the vocab to a target vocab size by automatically picking a matching min_count.
             If the specified
            min_count is more than the calculated min_count, the specified min_count will be
             used.
            Set to `None` if not required.
        sample : float, optional
            The threshold for configuring which higher-frequency words are randomly
             downsampled,
            useful range is (0, 1e-5).
        hashfxn : function, optional
            Hash function to use to randomly initialize weights, for increased training
             reproducibility.
        iter : int, optional
            Number of iterations (epochs) over the corpus.
        trim_rule : function, optional
            Vocabulary trimming rule, specifies whether certain words should remain in the
             vocabulary,
            be trimmed away, or handled using the default (discard if word count < min_count).
            Can be None (min_count will be used, look to :func:`~gensim.utils.keep_vocab_item`),
            or a callable that accepts parameters (word, count, min_count) and returns either
            :attr:`gensim.utils.RULE_DISCARD`, :attr:`gensim.utils.RULE_KEEP` or :attr:`gensim.utils.
             RULE_DEFAULT`.
            The rule, if given, is only used to prune vocabulary during build_vocab() and is not
             stored as part of the
            model.

            The input parameters are of the following types:
                * `word` (str) - the word we are examining
                * `count` (int) - the word's frequency count in the corpus
                * `min_count` (int) - the minimum count threshold.

        sorted_vocab : {0, 1}, optional
            If 1, sort the vocabulary by descending frequency before assigning word indexes.
            See :meth:`~gensim.models.word2vec.Word2VecVocab.sort_vocab()`.
        batch_words : int, optional
            Target size (in words) for batches of examples passed to worker threads (and
            thus cython routines).(Larger batches will be passed if individual
            texts are longer than 10000 words, but the standard cython code truncates to that
             maximum.)
        compute_loss: bool, optional
            If True, computes and stores loss value which can be retrieved using
            :meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.
        callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`, optional
            Sequence of callbacks to be executed at specific stages during training.

        Examples
        --------
        Initialize and train a :class:`~gensim.models.word2vec.Word2Vec` model

        >>> from gensim.models import Word2Vec
        >>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
        >>> model = Word2Vec(sentences, min_count=1)

        """
        self.max_final_vocab = max_final_vocab
        self.callbacks = callbacks
        self.load = call_on_class_only
        self.wv = Word2VecKeyedVectors(size)
        self.vocabulary = Word2VecVocab(
            max_vocab_size=max_vocab_size, min_count=min_count, sample=sample,
             sorted_vocab=bool(sorted_vocab),
            null_word=null_word, max_final_vocab=max_final_vocab, ns_exponent=ns_exponent)
        self.trainables = Word2VecTrainables(seed=seed, vector_size=size, hashfxn=hashfxn)
        super(Word2Vec, self).__init__(sentences=sentences, workers=workers,
         vector_size=size, epochs=iter, callbacks=callbacks, batch_words=batch_words,
         trim_rule=trim_rule, sg=sg, alpha=alpha, window=window, seed=seed, hs=hs,
         negative=negative, cbow_mean=cbow_mean, min_alpha=min_alpha,
         compute_loss=compute_loss, fast_version=FAST_VERSION)

    def _do_train_job(self, sentences, alpha, inits):
        """Train the model on a single batch of sentences.

        Parameters
        ----------
        sentences : iterable of list of str
            Corpus chunk to be used in this training batch.
        alpha : float
            The learning rate used in this batch.
        inits : (np.ndarray, np.ndarray)
            Each worker threads private work memory.

        Returns
        -------
        (int, int)
             2-tuple (effective word count after ignoring unknown words and sentence length
              trimming, total word count).

        """
        work, neu1 = inits
        tally = 0
        if self.sg:
            tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss)
        else:
            tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
        return tally, self._raw_word_count(sentences)

    def _clear_post_train(self):
        """Remove all L2-normalized word vectors from the model."""
        self.wv.vectors_norm = None

    def _set_train_params(self, **kwargs):
        if 'compute_loss' in kwargs:
            self.compute_loss = kwargs['compute_loss']
        self.running_training_loss = 0

    def train(self, sentences, total_examples=None, total_words=None,
        epochs=None, start_alpha=None, end_alpha=None, word_count=0,
        queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=()):
        """Update the model's neural weights from a sequence of sentences.

        Notes
        -----
        To support linear learning-rate decay from (initial) `alpha` to `min_alpha`, and accurate
        progress-percentage logging, either `total_examples` (count of sentences) or
         `total_words` (count of
        raw words in sentences) **MUST** be provided. If `sentences` is the same corpus
        that was provided to :meth:`~gensim.models.word2vec.Word2Vec.build_vocab` earlier,
        you can simply use `total_examples=self.corpus_count`.

        Warnings
        --------
        To avoid common mistakes around the model's ability to do multiple training passes
         itself, an
        explicit `epochs` argument **MUST** be provided. In the common and recommended
         case
        where :meth:`~gensim.models.word2vec.Word2Vec.train` is only called once, you can
         set `epochs=self.iter`.

        Parameters
        ----------
        sentences : iterable of list of str
            The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,
            consider an iterable that streams the sentences directly from disk/network.
            See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.
             word2vec.Text8Corpus`
            or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.
             word2vec` module for such examples.
            See also the `tutorial on data streaming in Python
            <https://rare-technologies.com/data-streaming-in-python-generators-iterators-
             iterables/>`_.
        total_examples : int, optional
            Count of sentences. Used to decay the `alpha` learning rate.
        total_words : int, optional
            Count of raw words in sentences. Used to decay the `alpha` learning rate.
        epochs : int, optional
            Number of iterations (epochs) over the corpus.
        start_alpha : float, optional
            Initial learning rate. If supplied, replaces the starting `alpha` from the constructor,
            for this one call to`train()`.
            Use only if making multiple calls to `train()`, when you want to manage the alpha
             learning-rate yourself
            (not recommended).
        end_alpha : float, optional
            Final learning rate. Drops linearly from `start_alpha`.
            If supplied, this replaces the final `min_alpha` from the constructor, for this one call to
             `train()`.
            Use only if making multiple calls to `train()`, when you want to manage the alpha
             learning-rate yourself
            (not recommended).
        word_count : int, optional
            Count of words already trained. Set this to 0 for the usual
            case of training on all words in sentences.
        queue_factor : int, optional
            Multiplier for size of queue (number of workers * queue_factor).
        report_delay : float, optional
            Seconds to wait before reporting progress.
        compute_loss: bool, optional
            If True, computes and stores loss value which can be retrieved using
            :meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.
        callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`, optional
            Sequence of callbacks to be executed at specific stages during training.

        Examples
        --------
        >>> from gensim.models import Word2Vec
        >>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
        >>>
        >>> model = Word2Vec(min_count=1)
        >>> model.build_vocab(sentences)  # prepare the model vocabulary
        >>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
         # train word vectors
        (1, 30)

        """
        return super(Word2Vec, self).train(
            sentences, total_examples=total_examples, total_words=total_words,
            epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha,
             word_count=word_count,
            queue_factor=queue_factor, report_delay=report_delay,
             compute_loss=compute_loss, callbacks=callbacks)

    def score(self, sentences, total_sentences=int(1e6), chunksize=100, queue_factor=2,
     report_delay=1):
        """Score the log probability for a sequence of sentences.
        This does not change the fitted model in any way (see :meth:`~gensim.models.word2vec.
         Word2Vec.train` for that).

        Gensim has currently only implemented score for the hierarchical softmax scheme,
        so you need to have run word2vec with `hs=1` and `negative=0` for this to work.

        Note that you should specify `total_sentences`; you'll run into problems if you ask to
        score more than this number of sentences but it is inefficient to set the value too high.

        See the `article by Matt Taddy: "Document Classification by Inversion of Distributed
         Language Representations"
        <https://arxiv.org/pdf/1504.07295.pdf>`_ and the
        `gensim demo <https://github.
         com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb>`_ for examples of
        how to use such scores in document classification.

        Parameters
        ----------
        sentences : iterable of list of str
            The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,
            consider an iterable that streams the sentences directly from disk/network.
            See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.
             word2vec.Text8Corpus`
            or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.
             word2vec` module for such examples.
        total_sentences : int, optional
            Count of sentences.
        chunksize : int, optional
            Chunksize of jobs
        queue_factor : int, optional
            Multiplier for size of queue (number of workers * queue_factor).
        report_delay : float, optional
            Seconds to wait before reporting progress.

        """
        if FAST_VERSION < 0:
            warnings.warn("C extension compilation failed, scoring will be slow. "
                "Install a C compiler and reinstall gensim for fastness.")
        logger.info("scoring sentences with %i workers on %i vocabulary and %i features, "
            "using sg=%s hs=%s sample=%s and negative=%s",
            self.workers, len(self.wv.vocab), self.trainables.layer1_size, self.sg, self.hs, self.
             vocabulary.sample, self.negative)
        if not self.wv.vocab:
            raise RuntimeError("you must first build vocabulary before scoring new data")
        if not self.hs:
            raise RuntimeError(
                "We have currently only implemented score for the hierarchical softmax scheme, "
                "so you need to have run word2vec with hs=1 and negative=0 for this to work.")
        def worker_loop():
            """Compute log probability for each sentence, lifting lists of sentences from the jobs
             queue."""
            work = zeros(1, dtype=REAL) # for sg hs, we actually only need one memory loc
             (running sum)
            neu1 = matutils.zeros_aligned(self.trainables.layer1_size, dtype=REAL)
            while True:
                job = job_queue.get()
                if job is None: # signal to finish
                    break
                ns = 0
                for sentence_id, sentence in job:
                    if sentence_id >= total_sentences:
                        break
                    if self.sg:
                        score = score_sentence_sg(self, sentence, work)
                    else:
                        score = score_sentence_cbow(self, sentence, work, neu1)
                    sentence_scores[sentence_id] = score
                    ns += 1

                progress_queue.put(ns) # report progress

        start, next_report = default_timer(), 1.0 # buffer ahead only a limited number of jobs..
         this is the reason we can't simply use ThreadPool :(
        job_queue = Queue(maxsize=queue_factor * self.workers)
        progress_queue = Queue(maxsize=(queue_factor + 1) * self.workers)
        workers = [threading.Thread(target=worker_loop) for _ in xrange(self.workers)]
        for thread in workers:
            thread.daemon = True # make interrupting the process with ctrl+c easier
            thread.start()

        sentence_count = 0
        sentence_scores = matutils.zeros_aligned(total_sentences, dtype=REAL)
        push_done = False
        done_jobs = 0
        jobs_source = enumerate(utils.grouper(enumerate(sentences), chunksize))
        # fill jobs queue with (id, sentence) job items
        while True:
            try:
                job_no, items = next(jobs_source)
                if (job_no - 1) * chunksize > total_sentences:
                    logger.warning("terminating after %i sentences (set higher total_sentences if you
                     want more).", total_sentences)
                    job_no -= 1
                    raise StopIteration()
                logger.debug("putting job #%i in the queue", job_no)
                job_queue.put(items)
            except StopIteration:
                logger.info("reached end of input; waiting to finish %i outstanding jobs", job_no -
                 done_jobs + 1)
                for _ in xrange(self.workers):
                    job_queue.put(None) # give the workers heads up that they can finish -- no more
                     work!

                push_done = True
            try:
                while done_jobs < (job_no + 1) or not push_done:
                    ns = progress_queue.get(push_done) # only block after all jobs pushed
                    sentence_count += ns
                    done_jobs += 1
                    elapsed = default_timer() - start
                    if elapsed >= next_report:
                        logger.info("PROGRESS: at %.2f%% sentences, %.0f sentences/s", 100.0 *
                         sentence_count, sentence_count / elapsed)
                        next_report = elapsed + report_delay # don't flood log, wait report_delay
                         seconds
                else:
                    break # loop ended by job count; really done

            except Empty:
                pass # already out of loop; continue to next push

        elapsed = default_timer() - start
        self.clear_sims()
        logger.info("scoring %i sentences took %.1fs, %.0f sentences/s", sentence_count,
         elapsed, sentence_count / elapsed)
        return sentence_scores[:sentence_count]

    def clear_sims(self):
        """Remove all L2-normalized word vectors from the model, to free up memory.

        You can recompute them later again using the :meth:`~gensim.models.word2vec.
         Word2Vec.init_sims` method.

        """
        self.wv.vectors_norm = None

    def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='utf8',
     unicode_errors='strict'):
        """Merge in an input-hidden weight matrix loaded from the original C word2vec-tool
         format,
        where it intersects with the current vocabulary.

        No words are added to the existing vocabulary, but intersecting words adopt the file's
         weights, and
        non-intersecting words are left alone.

        Parameters
        ----------
        fname : str
            The file path to load the vectors from.
        lockf : float, optional
            Lock-factor value to be set for any imported word-vectors; the
            default value of 0.0 prevents further updating of the vector during subsequent
            training. Use 1.0 to allow further training updates of merged vectors.
        binary : bool, optional
            If True, `fname` is in the binary word2vec C format.
        encoding : str, optional
            Encoding of `text` for `unicode` function (python2 only).
        unicode_errors : str, optional
            Error handling behaviour, used as parameter for `unicode` function (python2 only).

        """
        overlap_count = 0
        logger.info("loading projection weights from %s", fname)
        with utils.smart_open(fname) as fin:
            header = utils.to_unicode(fin.readline(), encoding=encoding)
            vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
            if not vector_size == self.wv.vector_size:
                raise ValueError("incompatible vector size %d in file %s" % (vector_size, fname)) #
                 TOCONSIDER: maybe mismatched vectors still useful enough to merge (truncating/padding)?
            if binary:
                binary_len = dtype(REAL).itemsize * vector_size
                for _ in xrange(vocab_size): # mixed text and binary: read text first, then binary
                    word = []
                    while True:
                        ch = fin.read(1)
                        if ch == b' ':
                            break
                        if ch != b'\n': # ignore newlines in front of words (some binary files have)
                            word.append(ch)

                    word = utils.to_unicode(b''.join(word), encoding=encoding,
                     errors=unicode_errors)
                    weights = fromstring(fin.read(binary_len), dtype=REAL)
                    if word in self.wv.vocab:
                        overlap_count += 1
                        self.wv.vectors[self.wv.vocab[word].index] = weights
                        self.trainables.vectors_lockf[self.wv.vocab[word].index] = lockf # lock-factor: 0.0
                         =no changes

            else:
                for line_no, line in enumerate(fin):
                    parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).
                     split(" ")
                    if len(parts) != vector_size + 1:
                        raise ValueError("invalid vector on line %s (is this really the text format?)" %
                         line_no)
                    word, weights = parts[0], [REAL(x) for x in parts[1:]]
                    if word in self.wv.vocab:
                        overlap_count += 1
                        self.wv.vectors[self.wv.vocab[word].index] = weights
                        self.trainables.vectors_lockf[self.wv.vocab[word].index] = lockf # lock-factor: 0.0
                         =no changes

        logger.info("merged %d vectors into %s matrix from %s", overlap_count, self.wv.vectors.
         shape, fname)
    @deprecated("Method will be removed in 4.0.0, use self.wv.__getitem__() instead")

    def __getitem__(self, words):
        """Deprecated. Use `self.wv.__getitem__` instead.
        Refer to the documentation for :meth:`~gensim.models.keyedvectors.
         Word2VecKeyedVectors.__getitem__`.

        """
        return self.wv.__getitem__(words)

    @deprecated("Method will be removed in 4.0.0, use self.wv.__contains__() instead")
    def __contains__(self, word):
        """Deprecated. Use `self.wv.__contains__` instead.
        Refer to the documentation for :meth:`~gensim.models.keyedvectors.
         Word2VecKeyedVectors.__contains__`.

        """
        return self.wv.__contains__(word)

    def predict_output_word(self, context_words_list, topn=10):
        """Get the probability distribution of the center word given context words.

        Parameters
        ----------
        context_words_list : list of str
            List of context words.
        topn : int, optional
            Return `topn` words and their probabilities.

        Returns
        -------
        list of (str, float)
            `topn` length list of tuples of (word, probability).

        """
        if not self.negative:
            raise RuntimeError(
                "We have currently only implemented predict_output_word for the negative
                 sampling scheme, "
                "so you need to have run word2vec with negative > 0 for this to work.")
        if not hasattr(self.wv, 'vectors') or not hasattr(self.trainables, 'syn1neg'):
            raise RuntimeError("Parameters required for predicting the output words not found.")
        word_vocabs = [self.wv.vocab[w] for w in context_words_list if w in self.wv.vocab]
        if not word_vocabs:
            warnings.warn("All the input context words are out-of-vocabulary for the current
             model.")
            return None
        word2_indices = [word.index for word in word_vocabs]
        l1 = np_sum(self.wv.vectors[word2_indices], axis=0)
        if word2_indices and self.cbow_mean:
            l1 /= len(word2_indices)
        # propagate hidden -> output and take softmax to get probabilities
        prob_values = exp(dot(l1, self.trainables.syn1neg.T))
        prob_values /= sum(prob_values)
        top_indices = matutils.argsort(prob_values, topn=topn, reverse=True) # returning the
         most probable output words with their probabilities
        return [(self.wv.index2word[index1], prob_values[index1]) for index1 in top_indices]

    def init_sims(self, replace=False):
        """Deprecated. Use `self.wv.init_sims` instead.
        See :meth:`~gensim.models.keyedvectors.Word2VecKeyedVectors.init_sims`.

        """
        if replace and hasattr(self.trainables, 'syn1'):
            del self.trainables.syn1
        return self.wv.init_sims(replace)

    def reset_from(self, other_model):
        """Borrow shareable pre-built structures from `other_model` and reset hidden layer
         weights.

        Structures copied are:
            * Vocabulary
            * Index to word mapping
            * Cumulative frequency table (used for negative sampling)
            * Cached corpus length

        Useful when testing multiple models on the same corpus in parallel.

        Parameters
        ----------
        other_model : :class:`~gensim.models.word2vec.Word2Vec`
            Another model to copy the internal structures from.

        """
        self.wv.vocab = other_model.wv.vocab
        self.wv.index2word = other_model.wv.index2word
        self.vocabulary.cum_table = other_model.vocabulary.cum_table
        self.corpus_count = other_model.corpus_count
        self.trainables.reset_weights(self.hs, self.negative, self.wv)

    @staticmethod
    def log_accuracy(section):
        """Deprecated. Use `self.wv.log_accuracy` instead.
        See :meth:`~gensim.models.word2vec.Word2VecKeyedVectors.log_accuracy`.

        """
        return Word2VecKeyedVectors.log_accuracy(section)

    @deprecated("Method will be removed in 4.0.0, use self.wv.evaluate_word_analogies()
     instead")
    def accuracy(self, questions, restrict_vocab=30000, most_similar=None,
     case_insensitive=True):
        """Deprecated. Use `self.wv.accuracy` instead.
        See :meth:`~gensim.models.word2vec.Word2VecKeyedVectors.accuracy`.

        """
        most_similar = most_similar or Word2VecKeyedVectors.most_similar
        return self.wv.accuracy(questions, restrict_vocab, most_similar, case_insensitive)

    def __str__(self):
        """Human readable representation of the model's state.

        Returns
        -------
        str
            Human readable representation of the model's state, including the vocabulary size,
             vector size
            and learning rate.

        """
        return "%s(vocab=%s, size=%s, alpha=%s)" % (
            self.__class__.__name__, len(self.wv.index2word), self.wv.vector_size, self.alpha)

    def delete_temporary_training_data(self, replace_word_vectors_with_normalized=False):
        """Discard parameters that are used in training and scoring, to save memory.

        Warnings
        --------
        Use only if you're sure you're done training a model.

        Parameters
        ----------
        replace_word_vectors_with_normalized : bool, optional
            If True, forget the original (not normalized) word vectors and only keep
            the L2-normalized word vectors, to save even more memory.

        """
        if replace_word_vectors_with_normalized:
            self.init_sims(replace=True)
        self._minimize_model()

    def save(self, *args, **kwargs):
        """Save the model.
        This saved model can be loaded again using :func:`~gensim.models.word2vec.
         Word2Vec.load`, which supports
        online training and getting vectors for vocabulary words.

        Parameters
        ----------
        fname : str
            Path to the file.

        """
        # don't bother storing the cached normalized vectors, recalculable table
        kwargs['ignore'] = kwargs.get('ignore', ['vectors_norm', 'cum_table'])
        super(Word2Vec, self).save(*args, **kwargs)

    def get_latest_training_loss(self):
        """Get current value of the training loss.

        Returns
        -------
        float
            Current training loss.

        """
        return self.running_training_loss

    @deprecated(
        "Method will be removed in 4.0.0, keep just_word_vectors = model.wv to retain just the
         KeyedVectors instance")
    def _minimize_model(self, save_syn1=False, save_syn1neg=False,
     save_vectors_lockf=False):
        if save_syn1 and save_syn1neg and save_vectors_lockf:
            return
        if hasattr(self.trainables, 'syn1') and not save_syn1:
            del self.trainables.syn1
        if hasattr(self.trainables, 'syn1neg') and not save_syn1neg:
            del self.trainables.syn1neg
        if hasattr(self.trainables, 'vectors_lockf') and not save_vectors_lockf:
            del self.trainables.vectors_lockf
        self.model_trimmed_post_training = True

    @classmethod
    def load_word2vec_format(
        cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
        limit=None, datatype=REAL):
        """Deprecated. Use :meth:`gensim.models.KeyedVectors.load_word2vec_format`
         instead."""
        raise DeprecationWarning("Deprecated. Use gensim.models.KeyedVectors.
         load_word2vec_format instead.")

    def save_word2vec_format(self, fname, fvocab=None, binary=False):
        """Deprecated. Use `model.wv.save_word2vec_format` instead.
        See :meth:`gensim.models.KeyedVectors.save_word2vec_format`.

        """
        raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")

    @classmethod
    def load(cls, *args, **kwargs):
        """Load a previously saved :class:`~gensim.models.word2vec.Word2Vec` model.

        See Also
        --------
        :meth:`~gensim.models.word2vec.Word2Vec.save`
            Save model.

        Parameters
        ----------
        fname : str
            Path to the saved file.

        Returns
        -------
        :class:`~gensim.models.word2vec.Word2Vec`
            Loaded model.

        """
        try:
            model = super(Word2Vec, cls).load(*args, **kwargs)
        # for backward compatibility for `max_final_vocab` feature
            if not hasattr(model, 'max_final_vocab'):
                model.max_final_vocab = None
                model.vocabulary.max_final_vocab = None
            return model
        except AttributeError:
            logger.info('Model saved using code from earlier Gensim Version. Re-loading old
             model in a compatible way.')
            from gensim.models.deprecated.word2vec import load_old_word2vec
            return load_old_word2vec(*args, **kwargs)

论文｜万物皆可Vector之Word2vec：2个模型、2个优化及实战使用

本主题文章将会分为三部分介绍,每部分的主题为: word2vec的前奏-统计语言模型(点击阅读) word2vec详解-风华不减其他xxx2vec论文和应用介绍后续会更新Embedding相关的文 ...
使用Gensim来实现Word2Vec和FastText

作者:Steeve Huang 编译:ronghuaiyang 导读嵌入是NLP的基础,这篇文章教你使用Gensim来实现Word2Vec和FastText,并通俗易懂的描述了Word2Vec和Fa ...
用word2vec解读延禧攻略人物关系

阅读难度:★★☆☆☆ 技能要求:机器学习.python.分词.数据可视化字数:1500字阅读时长:6分钟本文结合最近热播的电视剧<延禧攻略>,对其人物的关系在数据上进行解读.通过从网 ...
gensim:用Word2Vec进行文本分析

文本分析我写过一期gensim库的,今天我想实现下word2vec,进行一些词语相似性分析. 用gensim库做文本相似性分析参数解释参数含义 sentences 形如[a,b,c...],且a ...
不懂word2vec，还敢说自己是做NLP？

选择"星标"公众号重磅干货,第一时间送达! 前言如今,深度学习炙手可热,deep learning在图像处理领域已经取得了长足的进展.随着Google发布word2vec, ...
【Hello NLP】CS224n学习笔记[3]:共现矩阵、SVD与GloVe词向量

相比于计算机视觉,NLP可能看起来没有那么有趣,这里没有酷炫的图像识别.AI作画.自动驾驶,我们要面对的,几乎都是枯燥的文本.语言.文字.但是,对于人工智能的征途来说,NLP才是皇冠上的那颗珍珠,它美 ...
ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测.评估输出结果设计思路核心代码 class TfidfVectorizer F ...
ML之SVM：利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估

ML之SVM:利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估输出结果 Fitting 3 folds for each of 12 candid ...
ML之SVM：利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估

ML之SVM:利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估输出结果 Fitting 3 folds for each of 12 candid ...
ML之NB：基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测

ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测输出结果设计思路核心代码 vec = CountVectorizer() X_train = vec.fit_transf ...
Dataset：fetch_20newsgroups(20类新闻文本)数据集的简介、安装、使用方法之详细攻略

Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介.安装.使用方法之详细攻略 fetch_20newsgroups(20类新闻文本)数据集的简介 20 newsgrou ...
ML之NB：利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测、评估

ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测.评估输出结果设计思路核心代码 htt ...
26套！高考英语3500词测验卷，每周20分钟，半年记住3500词！

各位同学.家长大家好,之前给大家分享了高中英语单词联想记忆法记住3500词,不知道我给大家发的电子版,同学们背的怎么样了. 对于高考英语大家都知道有3500词是必须背的,有些高一的同学,在开始学习英语 ...
ML之NB：基于news新闻文本数据集利用朴素贝叶斯算法实现文本分类预测daiding

ML之NB:基于news新闻文本数据集利用朴素贝叶斯算法实现文本分类预测基于news新闻文本数据集利用朴素贝叶斯算法实现文本分类预测设计思路更新-- 输出结果 <class 'pandas ...
ML之NB：基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

ML之NB:基于news新闻文本数据集利用纯统计法.kNN.朴素贝叶斯(高斯/多元伯努利/多项式).线性判别分析LDA.感知器等算法实现文本分类预测相关文章 ML之NB:基于news新闻文本数据集利 ...

NLP之词向量：利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)

输出结果

设计思路

核心代码

相关推荐