Poisson factorization for sparse count matrices¶
For more information, visit the project’s homepage https://www.github.com/davidcortes/poismf

class
poismf.
PoisMF
(k=50, method='tncg', l2_reg='auto', l1_reg=0.0, niter='auto', maxupd='auto', limit_step=True, initial_step=1e07, early_stop=True, reuse_prev=False, weight_mult=1.0, random_state=1, reindex=True, copy_data=True, produce_dicts=False, use_float=False, handle_interrupt=True, nthreads=1)[source]¶ Bases:
object
Poisson Matrix Factorization
Fast and memoryefficient model for recommender systems based on Poisson factorization of sparse counts data (e.g. number of times a user played different songs), using gradientbased optimization procedures.
Note
In order to obtain sparse latent factor matrices, you need to pass
method='tncg'
and a largeniter
, such asniter=50
orniter=100
. The L1 regularization approach is not recommended, even though it might also produce relatively sparse results with the other optimization methods.Note
If passing
reindex=True
, it will internally reindex all user and item IDs. Your data will not require reindexing if the IDs for users and items meet the following criteria: Are all integers.
 Start at zero.
 Don’t have any enumeration gaps, i.e. if there is a user ‘4’, user ‘3’ must also be there.
Note
When using proximal gradient method, this model is prone to numerical instability, and can turn out to spit all NaNs or zeros in the fitted parameters. The conjugate gradient and NewtonCG methods are not prone to such failed optimizations.
Parameters: k (int) – Number of latent factors to use (dimensionality of the lowrank factorization). If
k
is small (e.g.k=5
), it’s recommended to usemethod='pg'
. For large values, (e.g.k=100
), it’s recommended to usemethod='tncg'
. For medium values (e.g.k=50
), it’s recommende to use eithermethod='tncg'
ormethod='cg'
.method (bool) –
 Optimization method to use. Options are:
"tncg"
: will use a truncated NewtonCG method. This is the slowest option, but tends to find better local optima, and if run for many iterations, tends to produce sparse latent factor matrices."cg"
: will use a Conjugate Gradient method, which is faster than the truncated NewtonCG, but tends not to reach the best local optima. Usually, with this method and the default hyperparameters, the latent factor matrices will not be very sparse."pg"
: will use a proximal gradient method, which is a lot faster than the other two and more memoryefficient, but tends to only work with very large regularization values, and doesn’t find as good local optima, nor tends to result in sparse factors.
l2_reg (float) – Strength of L2 regularization. It is recommended to use small values along with
method='tncg'
, very large values along withmethod='pg'
, and medium to large values withmethod='cg'
. If passing"auto"
, will set it to 10^3 for TNCG, 10^4 for CG, and 10^9 for PG.l1_reg (float) – Strength of L1 regularization. Not recommended.
niter (int) – Number of alternating iterations to perform. One iteration denotes an update over both matrices. If passing
'auto'
, will set it to 10 for TNCG and PG, or 30 for CG.Using more iterations usually leads to better results for CG and TNCG, at the expense of longer fitting times.
maxupd (int) – Maximum number of updates to each user/item vector within an iteration. Note: for ‘method=TNCG’, this means maximum number of function evaluations rather than number of updates, so it should be higher. You might also want to try decreasing this while increasing
niter
. Formethod='pg'
, this will be taken as the actual number of updates, as it does not perform a line search like the other methods. If passing"auto"
, will set it to5*k
formethod='tncg'
, 25 formethod='cg'
, and 10 formethod='pg'
. If usingmethod='cg'
, you might also try instead settingmaxupd=1
andniter=100
.limit_step (bool) – When passing
method='cg'
, whether to limit the step sizes in each update so as to drive at most one variable to zero each time, as prescribed in [2]. If running the procedure for many iterations, it’s recommended to set this to ‘True’. You also might setmethod='cg'
plusmaxupd=1
andlimit_step=False
to reduce the algorithm to simple gradient descent with a line search.initial_step (float) – Initial step size to use for proximal gradient updates. Larger step sizes reach converge faster, but are more likely to result in failed optimization. Ignored when passing
method='tncg'
ormethod='cg'
, as those will perform a line seach instead.early_stop (bool) – In the TNCG method, whether to stop before reaching the maximum number of iterations if the updates do not change the factors significantly or at all.
reuse_prev (bool) – In the TNCG method, whether to reuse the factors obtained in the previous iteration as starting point for the optimization procedure. This has the effect of reaching convergence much quicker, but will oftentimes lead to slightly worse solutions.
Setting it to
True
has the side effect of potentially making the factors obtained when fitting the model different from the factors obtained after calling thepredict_factors
function with the same data the model was fit.For the other methods, this is always assumed
True
.weight_mult (float > 0) – Extra multiplier for the weight of the positive entries over the missing entries in the matrix to factorize. Be aware that Poisson likelihood will implicitly put more weight on the nonmissing entries already. Passing larger values will make the factors have larger values (which might be desirable), and can help with instability and failed optimization cases. If passing this, it’s recommended to try very large values (e.g. 10^2), and might require adjusting the other hyperparameters.
random_state (int, RandomState, or Generator) – Random seed to use to initialize model parameters. If passing a NumPy ‘RandomState’, will use it to draw a random integer as initial seed. If passing a NumPy ‘Generator’, will use it directly for drawing random numbers.
reindex (bool) – Whether to reindex data internally. Will be ignored if passing a sparse COO matrix to ‘fit’.
copy_data (bool) – Whether to make deep copies of the data in order to avoid modifying it inplace. Passing ‘False’ is faster, but might modify the ‘X’ inputs inplace if they are DataFrames.
produce_dicts (bool) – Whether to produce Python dictionaries for users and items, which are used to speedup the prediction API of this package. You can still predict without them, but it might take some additional miliseconds (or more depending on the number of users and items).
use_float (bool) – Whether to use C float type (typically
np.float32
) instead of C double type (typicallynp.float64
). Using float types is faster and uses less memory, but it has less numerical precision, which is problematic with this type of model.handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).
nthreads (int) – Number of threads to use to parallelize computations. If set to 0 or less, will use the maximum available on the computer.
Variables:  A (array (nusers, k)) – Userfactor matrix.
 B (array (nitems, k)) – Itemfactor matrix.
 user_mapping (array (nusers,)) – ID of the user (as passed to .fit) corresponding to each row of A.
 item_mapping (array (nitems,)) – ID of the item (as passed to .fit) corresponding to each row of B.
 user_dict (dict (nusers)) – Dictionary with the mapping between user IDs (as passed to .fit) and rows of A.
 item_dict (dict (nitems)) – Dictionary with the mapping between item IDs (as passed to .fit) and rows of B.
 is_fitted (bool) – Whether the model has been fit to some data.
References
[1] Cortes, David. “Fast NonBayesian Poisson Factorization for ImplicitFeedback Recommendations.” arXiv preprint arXiv:1811.01908 (2018). [2] Li, Can. “A conjugate gradient type method for the nonnegative constraints optimization problems.” Journal of Applied Mathematics 2013 (2013). [3] Carlsson, Christer, et al. “User’s guide for TN/TNBC: Fortran routines for nonlinear optimization.” Mathematical Sciences Dept. Tech. Rep. 307, The Johns Hopkins University. 1984. 
eval_llk
(X_test, full_llk=False, include_missing=False)[source]¶ Evaluate Poisson loglikelihood (plus constant) for a given dataset
Note
By default, this Poisson loglikelihood is calculated only for the combinations of users (rows) and items (columns) provided in
X_test
here, ignoring the missing entries. This is the usual usecase for evaluating a validation or test set, but can also be used for evaluating it on the training data with all missing entries included as zeros (see parameters for details). Note that this calculates a sum, not an average.Note
If using more than 1 thread, the results might vary slightly between runs.
Parameters:  X_test (Pandas DataFrame(nobs, 3) or COO(m, n)) – Input data on which to calculate loglikelihood, consisting of IDs and counts. If passing a DataFrame, must contain one row per nonzero observaion, with columns ‘UserId’, ‘ItemId’, ‘Count’. If passing a sparse COO matrix, must have the same number of rows and columns as the one that was passed to ‘fit’. If using ‘reindex=True’, cannot be passed as a COO matrix.
 full_llk (bool) – Whether to add to the number a constant given by the data which doesn’t depend on the fitted parameters. If passing ‘False’ (the default), there is some chance that the resulting loglikelihood will end up being positive  this is not an error, but is simply due to ommission of this constant. Passing ‘True’ might result in numeric overflow and low numerical precision.
 include_missing (bool) – If ‘True’, will calculate the Poisson loglikelihood for all entries (i.e. all combinations of users/items, whole matrix ‘X’), taking the missing ones as zerovalued. If passing ‘False’, will calculate the Poisson loglikelihood only for the nonmissing entries passed in ‘X_test’  this is usually the desired behavior when evaluating a test dataset.
Returns: llk – Obtained loglikelihood (higher is better).
Return type: float

fit
(X)[source]¶ Fit Poisson Model to sparse counts data
Parameters: X (Pandas DataFrame (nobs, 3) or COO(m, n)) – Counts atrix to factorize, in sparse triplets format. Can be passed either as a SciPy sparse COO matrix (recommended) with users being rows and items being columns, or as a Pandas DataFrame, in which case it should have the following columns: ‘UserId’, ‘ItemId’, ‘Count’. Combinations of users and items not present are implicitly assumed to be zero by the model. The nonmissing entries must all be greater than zero. If passing a COO matrix, will force ‘reindex’ to ‘False’. Returns: self – This object Return type: obj

fit_unsafe
(A, B, Xcsr, Xcsc)[source]¶ Faster version for ‘fit’ with no input checks or castings
This is intended as a faster alternative to
fit
when a model is to be fit multiple times with different hyperparameters. It will not make any checks or conversions on the inputs, as it will assume they are all in the right format.Passing the wrong types of inputs or passing inputs with mismatching shapes will crash the Python process. For most use cases, it’s recommended to use
fit
instead.Note
Calling this will override
produce_dicts
andreindex
(will set both toFalse
).Parameters:  A (array(dimA, k)) – The alreadyintialized userfactor matrices, as a NumPy array
of type C dobule (this is usually
np.float64
) in rowmajor order (a.k.a. C contiguous). Should not be a view/subset of a larger array (flag ‘OWN_DATA’). Will be modified inplace.  B (array(dimB, k)) – The alreadyinitialized itemfactor matrices (see documentation
about
A
for details on the format).  Xcsr (CSR(dimA, dimB)) – The ‘X’ matrix to factorize in sparse CSR format (from SciPy).
Must have the
indices
andindptr
as type C size_t (this is usuallynp.uint64
). Note that setting them to this type might render the matrices unusable in SciPy’s own sparse module functions. Thedata
part must be of type C double (this is usuallynp.float64
) or C float (usuallynp.float32
) depending on whether the object was constructed withuse_float=True
oruse_float=False
.  Xcsc (CSC(dimA, dimB)) – The ‘X’ matrix to factorize in sparse CSC format (from SciPy).
See documentation about
Xcsr
for details.
Returns: self – This object
Return type: obj
 A (array(dimA, k)) – The alreadyintialized userfactor matrices, as a NumPy array
of type C dobule (this is usually

predict
(user, item)[source]¶ Predict expected count for combinations of users (rows) and items (columns)
Note
You can either pass an individual user and item, or arrays representing tuples (UserId, ItemId) with the combinatinons of users and items for which to predict (one entry per prediction).
Parameters:  user (arraylike (npred,) or obj) – User(s) for which to predict each item.
 item (arraylike (npred,) or obj) – Item(s) for which to predict for each user. Each entry will
be matched with the corresponding entry in
user
.
Returns: pred – Expected counts for the requested user(row)/item(column) combinations.
Return type: array(npred,)

predict_factors
(X, l2_reg=None, l1_reg=None, weight_mult=None, maxupd=None)[source]¶ Get latent factors for a new user given her item counts
This is similar to obtaining topics for a document in LDA. See also method ‘transform’ for getting factors for multiple users/rows at a time.
Note
This function works with one user at a time, and will use the truncated NewtonCG approach regardless of how the model was fit. Note that, since this optimization method will likely have different optimal hyperparameters than the other methods, it offers the option of varying those hyperparameters in here.
Note
The factors are initialized to the mean of each column in the fitted model.
Parameters:  X (DataFrame or tuple(2)) – Either a DataFrame with columns ‘ItemId’ and ‘Count’, indicating the nonzero item counts for a user for whom it’s desired to obtain latent factors, or a tuple with the first entry being the items/columns that have a nonzero count, and the second entry being the actual counts.
 l2_reg (float) – Strength of L2 regularization to use for optimizing the new factors.
If passing
None
, will take the value set in the model object.  l1_reg (float) – Strength of the L1 regularization. Not recommended.
If passing
None
, will take the value set in the model object.  weight_mult (float) – Weight multiplier for the positive entries over the missing entries.
If passing
None
, will take the value set in the model object.  maxupd (int > 0) – Maximum number of NewtonCG updates to perform. You might want to
increase this value depending on the usecase.
If passing
None
, will take the value set in the model object.
Returns: latent_factors – Calculated latent factors for the user, given the input data
Return type: array (k,)

topN
(user, n=10, include=None, exclude=None, output_score=False)[source]¶ Rank topN highestpredicted items for an existing user
Parameters:  user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
 n (int) – Number of topN highestpredicted results to output.
 include (arraylike) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
 exclude (arraylike) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
 output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
Returns:  items (array(n,)) – The topN highest predicted items for this user. If the ‘X’ data passed to ‘fit’ was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
 scores (array(n,)) – The predicted scores for the topN items. Will only be returned
when passing
output_score=True
, in which case the result will be a tuple with these two entries.

topN_new
(X, n=10, include=None, exclude=None, output_score=False, l2_reg=100000.0, l1_reg=0.0, weight_mult=1.0, maxupd=100)[source]¶ Rank topN highestpredicted items for a new user
Note
This function calculates the latent factors in the same way as
predict_factors
 see the documentation ofpredict_factors
for details.Note
The factors are initialized to the mean of each column in the fitted model.
Parameters:  X (DataFrame or tuple(2)) – Either a DataFrame with columns ‘ItemId’ and ‘Count’, indicating the nonzero item counts for a user for whom it’s desired to obtain latent factors, or a tuple with the first entry being the items/columns that have a nonzero count, and the second entry being the actual counts.
 n (int) – Number of topN highestpredicted results to output.
 include (arraylike) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
 exclude (arraylike) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include’ or ‘exclude’. Must not contain duplicated entries.
 output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
 l2_reg (float) – Strength of L2 regularization to use for optimizing the new factors.
If passing
None
, will take the value set in the model object.  l1_reg (float) – Strength of the L1 regularization. Not recommended.
If passing
None
, will take the value set in the model object.  weight_mult (float) – Weight multiplier for the positive entries over the missing entries.
If passing
None
, will take the value set in the model object.  maxupd (int > 0) – Maximum number of NewtonCG updates to perform. You might want to
increase this value depending on the usecase.
If passing
None
, will take the value set in the model object.
Returns:  items (array(n,)) – The topN highest predicted items for this user. If the ‘X’ data passed to ‘fit’ was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
 scores (array(n,)) – The predicted scores for the topN items. Will only be returned
when passing
output_score=True
, in which case the result will be a tuple with these two entries.

transform
(X, y=None)[source]¶ Determine latent factors for new rows/users
Note
This function will use the same method and hyperparameters with which the model was fit. If using this for recommender systems, it’s recommended to use instead the function ‘predict_factors’, even though the new factors obtained with that function might not end up being in the same scale as the original factors in ‘self.A’.
Note
When using proximal gradient method (the default), results from this function and from ‘fit’ on the same datamight differ a lot. If this is a problem, it’s recommended to use conjugate gradient instead.
Note
This function is prone to producing all zeros or all NaNs values.
Note
The factors are initialized to the mean of each column in the fitted model.
Parameters: X (DataFrame(nnz, 3), CSR(n_new, nitems), or COO(n_new, nitems)) – New matrix for which to determine latent factors. The items/columns must be the same ones as passed to ‘fit’, while the rows correspond to new entries that were not passed to ‘fit’.
 If passing a DataFrame, must have columns ‘UserId’, ‘ItemId’, ‘Count’. The ‘UserId’ column will be remapped, with the mapping returned as part of the output.
 If passing a COO matrix, will be casted to CSR.
 If passing a CSR matrix (recommended), will use it asis and will not return a mapping as the output will match rowbyrow with ‘X’.
y (None) – Ignored. Kept in place for compatibility with SciKitLearn pipelining.
Returns:  A_new (array(n_new, k)) – The obtained latent factors for each row of ‘X’.
 user_mapping (array(n_new)) – The mapping from ‘UserId’ in ‘X’ to rows of ‘A_new’. Only produced when passing ‘X’ as a DataFrame, in which case the output will be a tuple ‘(A_new, user_mapping_)’.