Poisson Factorization

This is the documentation page for the Python package poismf, which produces approximate non-negative low-rank matrix factorizations of sparse counts matrices by maximizing Poisson likelihood minus a regularization term, the result of which can be used for e.g. implicit-feedback recommender systems or bag-of-words-based topic modeling.

For more information, visit the project’s GitHub page:

https://www.github.com/david-cortes/poismf

For the R version, see the CRAN page:

https://cran.r-project.org/package=poismf

Installation

The Python version of this package can be easily installed from PyPI

pip install poismf

(See the GitHub page for more details)

PoisMF

class poismf.PoisMF(k=50, method='tncg', l2_reg='auto', l1_reg=0.0, niter='auto', maxupd='auto', limit_step=True, initial_step=1e-07, early_stop=True, reuse_prev=False, weight_mult=1.0, random_state=1, reindex=True, copy_data=True, produce_dicts=False, use_float=True, handle_interrupt=True, nthreads=-1, n_jobs=None)[source]

Bases: object

Poisson Matrix Factorization

Fast and memory-efficient model for recommender systems based on Poisson factorization of sparse counts data (e.g. number of times a user played different songs), using gradient-based optimization procedures.

The model idea is to approximate:
\(\mathbf{X} \sim \text{Poisson}(\mathbf{A} \mathbf{B}^T)\)

Note

If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require reindexing if the IDs for users and items meet the following criteria:

  1. Are all integers.
  2. Start at zero.
  3. Don’t have any enumeration gaps, i.e. if there is a user ‘4’, user ‘3’ must also be there.

Note

Although the main idea behind this software is to produce sparse model/factor matrices, they are always taken in dense format when used inside this software, and as such, it might be faster to use these matrices through some other external library that would be able to exploit their sparsity.

Note

When using proximal gradient method, this model is prone to numerical instability, and can turn out to spit all NaNs or zeros in the fitted parameters. The TNCG method is not prone to such failed optimizations.

Parameters:
  • k (int) – Number of latent factors to use (dimensionality of the low-rank factorization). If k is very small (e.g. k=3), it’s recommended to use method='pg', otherwise it’s recommended to use method='tncg', and if using method='cg', it’s recommended to use large k (at least 100).

  • method (bool) –

    Optimization method to use as inner solver. Options are:
    • "tncg" : will use the conjugate gradient method from reference [2]. This is the slowest option, but tends to find better local optima, and if either run for many inner iterations (controlled by maxupd) or reusing previous solutions each time (controlled by reuse_prev), tends to produce sparse latent factor matrices. Note that when reusing previous solutions, fitting times are much faster and the quality of the results as evaluated by ranking-based recommendation quality metrics is almost as good, but solutions tend to be less sparse (see reference [1] for details). Unlike the other two, this solver is extremely unlikely to fail to produce results, and it is thus the recommended one.
    • "cg" : will use the conjugate gradient method from reference [3], which is faster than the one from reference [2], but tends not to reach as good local optima. Usually, with this method and the default hyperparameters, the latent factor matrices will be very sparse, but note that it can fail to produce results (in which case the obtained factors will be very poor quality without warning) when k is small (recommended to use k>=100 when using this solver).
    • "pg" : will use a proximal gradient method, which is a lot faster than the other two and more memory-efficient, but tends to only work with very large regularization values, and doesn’t find as good local optima, nor tends to result in sparse factors. Under this method, top-N recommendations tend to have little variation from one user to another.
  • l2_reg (float) – Strength of L2 regularization. It is recommended to use small values along with method='tncg', very large values along with method='pg', and medium to large values with method='cg'. If passing "auto", will set it to \(10^3\) for TNCG, \(10^4\) for CG, and \(10^9\) for PG.

  • l1_reg (float) – Strength of L1 regularization. Not recommended.

  • niter (int) – Number of outer iterations to perform. One iteration denotes an update over both matrices. If passing 'auto', will set it to 10 for TNCG and PG, or 30 for CG.

    Using more iterations usually leads to better results for CG, at the expense of longer fitting times. TNCG is more likely to converge to a local optimum with fewer outer iterations, with further iterations not changing the values of any single factor.

  • maxupd (int) – Maximum number of inner iterations for each user/item vector within. Note: for ‘method=TNCG’, this means maximum number of function evaluations rather than number of updates, so it should be higher. You might also want to try decreasing this while increasing niter. For method='pg', this will be taken as the actual number of updates, as it does not perform a line search like the other methods. If passing "auto", will set it to 15*k for method='tncg', 5 for method='cg', and 10 for method='pg'. If using method='cg', one might also want to try other combinations such as maxupd=1 and niter=100.

  • limit_step (bool) – When passing method='cg', whether to limit the step sizes in each update so as to drive at most one variable to zero each time, as prescribed in [2]. If running the procedure for many iterations, it’s recommended to set this to ‘True’. You also might set method='cg' plus maxupd=1 and limit_step=False to reduce the algorithm to simple projected gradient descent with a line search.

  • initial_step (float) – Initial step size to use for proximal gradient updates. Larger step sizes reach converge faster, but are more likely to result in failed optimization. Ignored when passing method='tncg' or method='cg', as those will perform a line seach instead.

  • early_stop (bool) – In the TNCG method, whether to stop before reaching the maximum number of iterations if the updates do not change the factors significantly or at all.

  • reuse_prev (bool) – In the TNCG method, whether to reuse the factors obtained in the previous iteration as starting point for each inner update. This has the effect of reaching convergence much quicker, but will oftentimes lead to slightly worse solutions.

    If passing False and maxupd is small, the obtained factors might not be sparse at all. If passing True, they will typically be less sparse than when passing False with large maxupd or than with method='cg'.

    Setting it to True has the side effect of potentially making the factors obtained when fitting the model different from the factors obtained after calling the predict_factors function with the same data the model was fit.

    For methods other than TNCG, this is always assumed True.

  • weight_mult (float > 0) – Extra multiplier for the weight of the positive entries over the missing entries in the matrix to factorize. Be aware that Poisson likelihood will implicitly put more weight on the non-missing entries already. Passing larger values will make the factors have larger values (which might be desirable), and can help with instability and failed optimization cases. If passing this, it’s recommended to try very large values (e.g. 10^2), and might require adjusting the other hyperparameters.

  • random_state (int, RandomState, or Generator) – Random seed to use to initialize model parameters. If passing a NumPy ‘RandomState’, will use it to draw a random integer as initial seed. If passing a NumPy ‘Generator’, will use it directly for drawing random numbers.

  • reindex (bool) – Whether to reindex data internally. Will be ignored if passing a sparse COO matrix to ‘fit’.

  • copy_data (bool) – Whether to make deep copies of the data in order to avoid modifying it in-place. Passing ‘False’ is faster, but might modify the ‘X’ inputs in-place if they are DataFrames.

  • produce_dicts (bool) – Whether to produce Python dictionaries for users and items, which are used to speed-up the prediction API of this package. You can still predict without them, but it might take some additional miliseconds (or more depending on the number of users and items).

  • use_float (bool) – Whether to use C float type (typically np.float32) instead of C double type (typically np.float64). Using float types is faster and uses less memory, but it has less numerical precision, which is problematic with this type of model.

  • handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).

  • nthreads (int) – Number of threads to use to parallelize computations. If passing a number lower than 1, will use the same formula as joblib does for calculating number of threads (which is n_cpus + 1 + n_jobs - i.e. pass -1 to use all available threads).

  • n_jobs (int) – Synonym for ‘nthreads’.

Variables:
  • A (array (nusers, k)) – User-factor matrix.
  • B (array (nitems, k)) – Item-factor matrix.
  • user_mapping (array (nusers,)) – ID of the user (as passed to .fit) corresponding to each row of A.
  • item_mapping (array (nitems,)) – ID of the item (as passed to .fit) corresponding to each row of B.
  • user_dict (dict (nusers)) – Dictionary with the mapping between user IDs (as passed to .fit) and rows of A.
  • item_dict (dict (nitems)) – Dictionary with the mapping between item IDs (as passed to .fit) and rows of B.
  • is_fitted (bool) – Whether the model has been fit to some data.

References

[1]Cortes, David. “Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations.” arXiv preprint arXiv:1811.01908 (2018).
[2](1, 2) Nash, Stephen G. “Newton-type minimization via the Lanczos method.” SIAM Journal on Numerical Analysis 21.4 (1984): 770-788.
[3]Li, Can. “A conjugate gradient type method for the nonnegative constraints optimization problems.” Journal of Applied Mathematics 2013 (2013).
fit(X)[source]

Fit Poisson Model to sparse counts data

Parameters:X (Pandas DataFrame (nobs, 3) or COO(m, n)) – Counts atrix to factorize, in sparse triplets format. Can be passed either as a SciPy sparse COO matrix (recommended) with users being rows and items being columns, or as a Pandas DataFrame, in which case it should have the following columns: ‘UserId’, ‘ItemId’, ‘Count’. Combinations of users and items not present are implicitly assumed to be zero by the model. The non-missing entries must all be greater than zero. If passing a COO matrix, will force ‘reindex’ to ‘False’.
Returns:self – This object
Return type:obj
fit_unsafe(A, B, Xcsr, Xcsc)[source]

Faster version for ‘fit’ with no input checks or castings

This is intended as a faster alternative to fit when a model is to be fit multiple times with different hyperparameters. It will not make any checks or conversions on the inputs, as it will assume they are all in the right format.

Passing the wrong types of inputs or passing inputs with mismatching shapes will crash the Python process. For most use cases, it’s recommended to use fit instead.

Note

Calling this will override produce_dicts and reindex (will set both to False).

Parameters:
  • A (array(dimA, k)) – The already-intialized user-factor matrices, as a NumPy array of type C dobule (this is usually np.float64) in row-major order (a.k.a. C contiguous). Should not be a view/subset of a larger array (flag ‘OWN_DATA’). Will be modified in-place.
  • B (array(dimB, k)) – The already-initialized item-factor matrices (see documentation about A for details on the format).
  • Xcsr (CSR(dimA, dimB)) – The ‘X’ matrix to factorize in sparse CSR format (from SciPy). Must have the indices and indptr as type C size_t (this is usually np.uint64). Note that setting them to this type might render the matrices unusable in SciPy’s own sparse module functions. The data part must be of type C double (this is usually np.float64) or C float (usually np.float32) depending on whether the object was constructed with use_float=True or use_float=False.
  • Xcsc (CSC(dimA, dimB)) – The ‘X’ matrix to factorize in sparse CSC format (from SciPy). See documentation about Xcsr for details.
Returns:

self – This object

Return type:

obj

predict(user, item)[source]

Predict expected count for combinations of users (rows) and items (columns)

Note

You can either pass an individual user and item, or arrays representing tuples (UserId, ItemId) with the combinatinons of users and items for which to predict (one entry per prediction).

Parameters:
  • user (array-like (npred,) or obj) – User(s) for which to predict each item.
  • item (array-like (npred,) or obj) – Item(s) for which to predict for each user. Each entry will be matched with the corresponding entry in user.
Returns:

pred – Expected counts for the requested user(row)/item(column) combinations.

Return type:

array(npred,)

predict_factors(X, l2_reg=None, l1_reg=None, weight_mult=None, maxupd=None)[source]

Get latent factors for a new user given her item counts

This is similar to obtaining topics for a document in LDA. See also method ‘transform’ for getting factors for multiple users/rows at a time.

Note

This function works with one user at a time, and will use the TNCG solver regardless of how the model was fit. Note that, since this optimization method may have different optimal hyperparameters than the other methods, it offers the option of varying those hyperparameters in here.

Note

The factors are initialized to the mean of each column in the fitted model.

Parameters:
  • X (DataFrame or tuple(2)) – Either a DataFrame with columns ‘ItemId’ and ‘Count’, indicating the non-zero item counts for a user for whom it’s desired to obtain latent factors, or a tuple with the first entry being the items/columns that have a non-zero count, and the second entry being the actual counts.
  • l2_reg (float) – Strength of L2 regularization to use for optimizing the new factors. If passing None, will take the value set in the model object.
  • l1_reg (float) – Strength of the L1 regularization. Not recommended. If passing None, will take the value set in the model object.
  • weight_mult (float) – Weight multiplier for the positive entries over the missing entries. If passing None, will take the value set in the model object.
  • maxupd (int > 0) – Maximum number of TNCG updates to perform. You might want to increase this value depending on the use-case. If passing None, will take the value set in the model object, clipped to a minimum of 1,000.
Returns:

latent_factors – Calculated latent factors for the user, given the input data

Return type:

array (k,)

topN(user, n=10, include=None, exclude=None, output_score=False)[source]

Rank top-N highest-predicted items for an existing user

Note

Even though the fitted model matrices might be sparse, they are always used in dense format here. In many cases it might be more efficient to produce the rankings externally through some library that would exploit the sparseness for much faster computations. The matrices can be access under self.A and self.B.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
  • n (int) – Number of top-N highest-predicted results to output.
  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to ‘fit’ was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(X, n=10, include=None, exclude=None, output_score=False, l2_reg=None, l1_reg=None, weight_mult=1.0, maxupd=None)[source]

Rank top-N highest-predicted items for a new user

Note

This function calculates the latent factors in the same way as predict_factors - see the documentation of predict_factors for details.

Just like topN, it does not exploit any potential sparsity in the fitted matrices and vectors, so it might be a lot faster to produce the recommendations externally (see the documentation for topN for details).

Note

The factors are initialized to the mean of each column in the fitted model.

Parameters:
  • X (DataFrame or tuple(2)) – Either a DataFrame with columns ‘ItemId’ and ‘Count’, indicating the non-zero item counts for a user for whom it’s desired to obtain latent factors, or a tuple with the first entry being the items/columns that have a non-zero count, and the second entry being the actual counts.
  • n (int) – Number of top-N highest-predicted results to output.
  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
  • l2_reg (float) – Strength of L2 regularization to use for optimizing the new factors. If passing None, will take the value set in the model object.
  • l1_reg (float) – Strength of the L1 regularization. Not recommended. If passing None, will take the value set in the model object.
  • weight_mult (float) – Weight multiplier for the positive entries over the missing entries. If passing None, will take the value set in the model object.
  • maxupd (int > 0) – Maximum number of TNCG updates to perform. You might want to increase this value depending on the use-case. If passing None, will take the value set in the model object, clipped to a minimum of 1,000.
Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to ‘fit’ was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

transform(X, y=None)[source]

Determine latent factors for new rows/users

Note

This function will use the same method and hyperparameters with which the model was fit. If using this for recommender systems, it’s recommended to use instead the function ‘predict_factors’ as it’s likely to be more precise.

Note

When using method='pg' (not recommended), results from this function and from ‘fit’ on the same datamight differ a lot.

Note

This function is prone to producing all zeros or all NaNs values.

Note

The factors are initialized to the mean of each column in the fitted model.

Parameters:
  • X (DataFrame(nnz, 3), CSR(n_new, nitems), or COO(n_new, nitems)) – New matrix for which to determine latent factors. The items/columns must be the same ones as passed to ‘fit’, while the rows correspond to new entries that were not passed to ‘fit’.

    • If passing a DataFrame, must have columns ‘UserId’, ‘ItemId’, ‘Count’. The ‘UserId’ column will be remapped, with the mapping returned as part of the output.
    • If passing a COO matrix, will be casted to CSR.
    • If passing a CSR matrix (recommended), will use it as-is and will not return a mapping as the output will match row-by-row with ‘X’.
  • y (None) – Ignored. Kept in place for compatibility with SciKit-Learn pipelining.

Returns:

  • A_new (array(n_new, k)) – The obtained latent factors for each row of ‘X’.
  • user_mapping (array(n_new)) – The mapping from ‘UserId’ in ‘X’ to rows of ‘A_new’. Only produced when passing ‘X’ as a DataFrame, in which case the output will be a tuple ‘(A_new, user_mapping_)’.

Indices and tables