Ellie Li

Master | Data Science, WPI

Selected Projects

Talent.AI

Internal Job Recommendation System

View More
Duration:
Sep 2018 - Dec 2018
Tools:
Python, nltk, gensim, TensorFlow, ElasticSearch
Goal:
Automatically extract skills from job descriptions and resumes using Natural Language Processing (NLP) techniques to do job and resume matching and recommendation, in order to reduce the resource cost for hiring process in a company and significantly help hiring managers release the pressure for all the hiring processes taking place in an organization.
Sponsor:
United Technologies (UTC)
Team:
3 data scientists from UTC HR analytics + 5 students from WPI data science
Challenges:
  • There is no existing dictionary for us to identify which words or phrases are perfessional skills in job descriptions.( solution: combined TextRank, Tf-idf, NER and RandomForest to filter out perfessional skills.
  • Computational time. (solution: ElasticSearch)
  • Hard to define the number and name of skill sets generated by K-means. (solution: removed clustering steps and introduced KNN to get the most similar skills of target skill.)
Contribution:
  • Designed algorithms to clean up irrelevant noisy words and extract skills from over 20,000 UTC’s job descriptions
  • Trained Word2Vec with UTC’s job descriptions to obtain n-grams semantic embedding for extracted skills
  • Adopted KNN to find each skill’s synonyms to generate extended skill sets and built and stored inverted index through ElasticSearch
  • Automated the job recommendation system allowing users to upload resumes and receive job recommendations
Feedbacks:

"Very cool. Thank you for sharing. The first match sounds like it aligned best with me!"

“I think this is a very powerful tool! I think with the resume I sent to you I would most definitely apply to this job as it touches on a lot of the skills that I listed…"

"All are jobs that would fit my background, but I personally know I wouldn’t want to work in those roles."

"I think the first 3 JD match my profile but not so much the fourth one."

"I went through the 5 recommendations and have some feedback : 1 - Poor Match, 2 - Good Match, 3 - Poor Match, 4 - Worst Match, 5 - Best Match."

  WHAT DID I LEARN?

  • Clarify the clients' goal before doing anything. Then stick on it.
  • Ready to make changes during implementation process. In this project, we are initially supposed to do clustering on skill sets and label these classes. However, sponsors found it is impossible to give a proper name to each skill class when we show the clustering results to them. Then they changed their requirements and we changed our methodology accordingly.
  • Stay “whelmed” - neither overwhelmed nor underwhelmed. We used to say yes when clients come up with new ideas and want us to have a try since we're willing to show a good personality. The most rewarding piece is, when you are open to new challenges, people will reach out to you. And these 'bonus' tasks will disturb our original timeline. Therefore, it's better to spend some time during meeting to discuss the feasibility of new idea and then decide whether to spend time on it.

Rap Maker

A LSTM model with information retrieval function for rap lyrics generation

View More
Duration:
Mar 2019 - Apr 2019
Tools:
Python, nltk, gensim, keras, tkinter, Github
Team:
4 students from WPI data science
Goal:
we collected and cleaned lyrics of existing hip-hop songs as a whole data set at first. Next we extracted keywords and did sentimental analysis for each piece of lyrics, and recorded both to create inverted index. Then trained LSTM model to generate new rap lyrics based on the query from users.
Challenges:
  • Processing of rhyme scheme (solution: colleted the last 2 letters for each line in training set when we built LSTM)
  • Automatically keyword extension must obey both grammar and emotional environment (solution: added information retrieval part)
Contribution:
  • Designed the framework of our model
  • Wrote python script of data preprocessing including keywords extraction (textRank, tf-idf, LDA) and sentiment analysis
  • Implemented the query and searching function
  • Designed and built the UI
More Info:
evakli11.github.io/CS525

  WHAT DID I LEARN?

  • Good lines of communication. Everyone needs to know what the goal is and have some kind of agreed-upon idea of how the goal will be accomplished. Having everyone review the framework before the actual implementation can save the team a lot of time because it can help the team prioritize some steps and identify if we are on the right track.
  • Positive interdependence. Groups often divide the workload among its members to move closer to the final goal. I have learned that if one person does not complete their individual part, the whole group suffers, and therefore it is important that each person fulfill their role according to the group's established timeline.
  • For the presentation of nlp project, the format is better to be more interactive because it allows space for the audience to engage.

Air Discount!

Database System

View More
Duration:
Oct 2017 - Dec 2017
Tools:
SQLite, Python, Django, html, css, JavaScript
Team:
3 students from WPI data science
Goal:
Our aim is to build a website with database for people searching for air tickets. Users can search with leave city, arrive city and the departure time to get the specific results. When there are multiple results of the specific searching, users can sort them by departure time, arrive time or ticket prices.
Challenges:
  • All members in our group were new to web development. We were not familiar with tools of web design and how to link front end and database (solution: we learn fast!)
  • As the discussion deepened, more and more ideas emerged. (solution: reach the agreement that start from the simple one)
Contribution:
  • Designed the ER diagram and clarified the 3NF Relations
  • Built the database in SQLite
  • Toke charge of the front end implementation
Demo:
evakli11.github.io/airdiscount/

  WHAT DID I LEARN?

  • By finishing this project, I had learnt a more advance way to use database system.
  • I get used to work under pressure. Actually, my schedule went crazy the first week because of these new tools. Later, it forced me to really focus on what I was learning, assess my priorities, and came up with a plan. Overall, I think of pressure as a form of motivation rather than an obstacle.
  • It is important to define the scope before implementation. Otherwise, new ideas kept coming up to us during the disccusion and we were stuck in the database design.

Who’s Paying?

A Joint Classification-Regression Model of Predicting Spenders and Revenue for Google Store

View More
Duration:
Nov 2018 - Dec 2018
Tools:
python, PyTorch, scikit-learn
Team:
3 students from data science + 1 student from computer science
Goal:
The ‘Google Analytics Customer Revenue Prediction’ is a Kaggle competition to predict the revenue generated per customer from data of the Google Merchandise Store (GStore). We would like to implement machine learning systems that accurately predicts customer generated revenue.
Challenges:
  • Some customers visit the GStore multiple times, which produces sequential data. (solution: introduced RNN model to improve the prediction of customers who visited more than one time )
  • The data presents us with a skewed target variable, where only a small number of customer visits generate non-zero revenue (<1.3% non zero on target feature). (solution: did log transform and undersampling on the dataset and added classification module before regression to filter out non-revenue customers)
Contribution:
  • Wrote python script of pre-classifier part of our pre-classified regression model to do visit based prediction. 5 classic classification algorithm was tried, including Decision Tree, Random Forest, SVM, Logistic Regression and KNN.
  • Tried stacked regression algorithms to improve the performance of regression parts
  • Implemented the Vanilla RNN via PyTorch to do revenue prediction as baseline model of our customer based prediction
More Info:
evakli11.github.io/Google-Store-Revenue-Prediction

  WHAT DID I LEARN?

  • Keep a healthy relationship with the team. For example, couching your feedback of other teammates' work in positive terms to avoid defensiveness when a misunderstanding or gap exists between us and providing modification suggestions kindly.
  • Be careful about distribution of data. In real world problems, the dataset cannot be normal distributeda and they always tend to be skewed.

Scheduling Assistant

Automate the Scheduling Process of Campus Event with Artificial Intelligence

View More
Duration:
Mar 2018 - Apr 2018
Team:
2 students from WPI data science + 1 student from WPI computer science + 1 student from WPI robotics
Goal:
This project is a scheduling assistant for a campus organization known as Engineering Ambassadors (EA). In EA, student-led events are held each week where participants are scheduled based on their availability. The person in charge of scheduling must ensure to meet some requirements. Since this can take up to a week to do by hand, this project aims to automate the process with Hill-climbing and Genetic Algorithms.
Tool:
Python
Challenges:
  • Too many constraints.These constraints add complexity to the scheduling because the schedule must be fair and each event has required roles.
  • We are given a coner case. The number of students is 30 and the number of event is 20. However, according to our constraints, one student cannot take in charge of too many roles or events
Contribution:
  • Defined the data structure of whole project and drew the class diagram
  • Wrote the Python script of hill-climbing algorithm
  • Documented the report of our outcomes
more info:
evakli11.github.io/schedulingAssistant/

  WHAT DID I LEARN?

  • Establishing good coding habits will enhance design factors like modularity, and your code will be easier to understand. It's important when collaborating with others.
  • Neither algorithm was wholly successful at creating a perfect schedule, because of how complex the data was and how much time it took to iterate through and test the algorithms. However, the schedules created were improvements from a completely random schedule.

  WHAT WOULD I HAVE DONE DIFFERENTLY?

For evaluation, currently we are only using categories like number of people with too many/few hours, number of people with too many roles filled, but this could not evaluate how a schedule actually works. For example, a schedule with 10 people of 2 hours exceeding the expected working hours is definitely better than a schedule with 10 people of 10 hours exceeding the expected working hours. By improving the strategy for evaluating schedules, we could collect more data for tuning the heuristic to produce even better outcomes.